* Re: 2.6.19 file content corruption on ext3 @ 2006-12-17 0:13 Andrei Popa 2006-12-17 12:06 ` Andrew Morton 0 siblings, 1 reply; 311+ messages in thread From: Andrei Popa @ 2006-12-17 0:13 UTC (permalink / raw) To: Linux Kernel Mailing List Hello, I had filesystem data corruption with rtorrent with 2.6.19. I tried recent git with Peter Zijlstra patch http://lkml.org/lkml/2006/12/16/144 and it seems that the problem is fixed. Please CC as I am not subscribed to lkml. Andrei ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-17 0:13 2.6.19 file content corruption on ext3 Andrei Popa @ 2006-12-17 12:06 ` Andrew Morton 2006-12-17 12:19 ` Marc Haber ` (2 more replies) 0 siblings, 3 replies; 311+ messages in thread From: Andrew Morton @ 2006-12-17 12:06 UTC (permalink / raw) To: andrei.popa Cc: Linux Kernel Mailing List, Peter Zijlstra, Hugh Dickins, Linus Torvalds, Florian Weimer, Marc Haber On Sun, 17 Dec 2006 02:13:18 +0200 Andrei Popa <andrei.popa@i-neo.ro> wrote: > Hello, > I had filesystem data corruption with rtorrent with 2.6.19. > I tried recent git with Peter Zijlstra patch > http://lkml.org/lkml/2006/12/16/144 and it seems that the problem is > fixed. > oh crap, I'd forgotten that test_clear_page_dirty() now fiddles with the ptes. I'd be really surprised if this was all due to a race though. Is everyone who has observed this problem running SMP and/or premptible kernels? Peter, why isn't that proposed patch's cleaning of the pte racy against do_wp_page()? ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-17 12:06 ` Andrew Morton @ 2006-12-17 12:19 ` Marc Haber 2006-12-17 12:32 ` Andrei Popa 2006-12-17 13:39 ` Andrei Popa 2 siblings, 0 replies; 311+ messages in thread From: Marc Haber @ 2006-12-17 12:19 UTC (permalink / raw) To: Andrew Morton Cc: andrei.popa, Linux Kernel Mailing List, Peter Zijlstra, Hugh Dickins, Linus Torvalds, Florian Weimer On Sun, Dec 17, 2006 at 04:06:20AM -0800, Andrew Morton wrote: > I'd be really surprised if this was all due to a race though. Is everyone > who has observed this problem running SMP and/or premptible kernels? Linux torres 2.6.19.1-zgsrv #1 SMP PREEMPT Wed Dec 13 01:31:27 UTC 2006 i686 GNU/Linux So, it's a "yes" to both counts, and I'll build a kernel without SMP and without preemption asap. Greetings Marc -- ----------------------------------------------------------------------------- Marc Haber | "I don't trust Computers. They | Mailadresse im Header Mannheim, Germany | lose things." Winona Ryder | Fon: *49 621 72739834 Nordisch by Nature | How to make an American Quilt | Fax: *49 621 72739835 ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-17 12:06 ` Andrew Morton 2006-12-17 12:19 ` Marc Haber @ 2006-12-17 12:32 ` Andrei Popa 2006-12-17 13:39 ` Andrei Popa 2 siblings, 0 replies; 311+ messages in thread From: Andrei Popa @ 2006-12-17 12:32 UTC (permalink / raw) To: Andrew Morton Cc: Linux Kernel Mailing List, Peter Zijlstra, Hugh Dickins, Linus Torvalds, Florian Weimer, Marc Haber ierdnac ~ # uname -a Linux ierdnac 2.6.20-rc1 #1 SMP PREEMPT Sun Dec 17 01:52:28 EET 2006 i686 Genuine Intel(R) CPU T2050 @ 1.60GHz GenuineIntel GNU/Linux On Sun, 2006-12-17 at 04:06 -0800, Andrew Morton wrote: > On Sun, 17 Dec 2006 02:13:18 +0200 > Andrei Popa <andrei.popa@i-neo.ro> wrote: > > > Hello, > > I had filesystem data corruption with rtorrent with 2.6.19. > > I tried recent git with Peter Zijlstra patch > > http://lkml.org/lkml/2006/12/16/144 and it seems that the problem is > > fixed. > > > > oh crap, I'd forgotten that test_clear_page_dirty() now fiddles with the > ptes. > > I'd be really surprised if this was all due to a race though. Is everyone > who has observed this problem running SMP and/or premptible kernels? > > Peter, why isn't that proposed patch's cleaning of the pte racy against > do_wp_page()? ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-17 12:06 ` Andrew Morton 2006-12-17 12:19 ` Marc Haber 2006-12-17 12:32 ` Andrei Popa @ 2006-12-17 13:39 ` Andrei Popa 2006-12-17 23:40 ` Andrew Morton 2 siblings, 1 reply; 311+ messages in thread From: Andrei Popa @ 2006-12-17 13:39 UTC (permalink / raw) To: Andrew Morton Cc: Linux Kernel Mailing List, Peter Zijlstra, Hugh Dickins, Linus Torvalds, Florian Weimer, Marc Haber I was mistaken, I'm still having file corruption with rtorrent. On Sun, 2006-12-17 at 04:06 -0800, Andrew Morton wrote: > On Sun, 17 Dec 2006 02:13:18 +0200 > Andrei Popa <andrei.popa@i-neo.ro> wrote: > > > Hello, > > I had filesystem data corruption with rtorrent with 2.6.19. > > I tried recent git with Peter Zijlstra patch > > http://lkml.org/lkml/2006/12/16/144 and it seems that the problem is > > fixed. > > > > oh crap, I'd forgotten that test_clear_page_dirty() now fiddles with the > ptes. > > I'd be really surprised if this was all due to a race though. Is everyone > who has observed this problem running SMP and/or premptible kernels? > > Peter, why isn't that proposed patch's cleaning of the pte racy against > do_wp_page()? ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-17 13:39 ` Andrei Popa @ 2006-12-17 23:40 ` Andrew Morton 2006-12-18 1:02 ` Linus Torvalds ` (2 more replies) 0 siblings, 3 replies; 311+ messages in thread From: Andrew Morton @ 2006-12-17 23:40 UTC (permalink / raw) To: andrei.popa Cc: Linux Kernel Mailing List, Peter Zijlstra, Hugh Dickins, Linus Torvalds, Florian Weimer, Marc Haber, Martin Michlmayr On Sun, 17 Dec 2006 15:39:32 +0200 Andrei Popa <andrei.popa@i-neo.ro> wrote: > I was mistaken, I'm still having file corruption with rtorrent. > Well I'm not very optimistic, but if people could try this, please... From: Andrew Morton <akpm@osdl.org> try_to_free_buffers() clears the page's dirty state if it successfully removed the page's buffers. Background for this: - a process does a one-byte-write to a file on a 64k pagesize, 4k blocksize ext3 filesystem. The page is now PageDirty, !PgeUptodate and has one dirty buffer and 15 not uptodate buffers. - kjournald writes the dirty buffer. The page is now PageDirty, !PageUptodate and has a mix of clean and not uptodate buffers. - try_to_free_buffers() removes the page's buffers. It MUST now clear PageDirty. If we were to leave the page dirty then we'd have a dirty, not uptodate page with no buffer_heads. We're screwed: we cannot write the page because we don't know which sections of it contain garbage. We cannot read the page because we don't know which sections of it contain modified data. We cannot free the page because it is dirty. Peter's "mm: tracking shared dirty pages" (d08b3851da41d0ee60851f2c75b118e1f7a5fc89) modified clear_page_dirty() so that it also clears the page's pte mapping's dirty flags, arranging for a subsequent userspace modification of the page to cause a fault. That change to clear_page_dirty() was correct for when it is called on the writeback path. Here, we effectively do: ClearPageDirty() pte_mkclean() submit-the-writeout if a page-dirtying via write() or via pte's happens after the ClearPageDirty() or the pte_mkclean() then the page is redirtied while writeout is in flight and the page will again need writing; no probs. But that change to clear_page_dirty() was incorrect for when it is called on the try_to_free_buffers() path. Here, we want to preserve any pte-dirtiness because we're not going to write the page to backing store. We need to keep a record of any userspace modification to the page. One way of addressing this would be to bale from try_to_free_buffers() if the page is mapped into pagetables. However that is racy, because the pagefault path doesn't lock the page when establishing a pte against it (I which it did - it would solve a lot of nasties). So this patch instead arranges for clear_page_dirty() to not clean the pte's when it is called on the try_to_free_buffers() path. clear_page_dirty() had several callers and it's not immediately obvious to me what the appropriate behaviour is in each case. Could maintainers please take a look? >From my quick reading, all callers of try_to_free_buffers() have already unmapped the page from pagetables, and given that the reported ext3 corruption happens on uniprocessor, non-preempt kernels, I doubt if this patch will fix things. But even if it is true that try_to_free_buffers() callers unmap the page first, this fix is still needed, because a minor fault could reestablish pte's in the meanwhile. Note that with this change, we can now restore try_to_free_buffers()'s ->private_lock to cover the test_clear_page_dirty(). If we indeed need to do that, it'll be in a separate patch. (Need to think about this some more. How can a page be pte-dirty, but not have dirty buffers? We're supposed to clean the pte's when we write the page, and we dirty the page and buffers when userspace dirties the pte...) Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: <reiserfs-dev@namesys.com> Cc: Dave Kleikamp <shaggy@austin.ibm.com> Cc: David Chinner <dgc@sgi.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> --- fs/buffer.c | 2 +- fs/cifs/file.c | 2 +- fs/fuse/file.c | 2 +- fs/hugetlbfs/inode.c | 2 +- fs/jfs/jfs_metapage.c | 2 +- fs/reiserfs/stree.c | 2 +- fs/xfs/linux-2.6/xfs_aops.c | 2 +- include/linux/page-flags.h | 6 +++--- mm/page-writeback.c | 5 +++-- mm/truncate.c | 4 ++-- 10 files changed, 15 insertions(+), 14 deletions(-) diff -puN fs/buffer.c~try_to_free_buffers-dont-clear-pte-dirty-bits fs/buffer.c --- a/fs/buffer.c~try_to_free_buffers-dont-clear-pte-dirty-bits +++ a/fs/buffer.c @@ -2858,7 +2858,7 @@ int try_to_free_buffers(struct page *pag * the page's buffers clean. We discover that here and clean * the page also. */ - if (test_clear_page_dirty(page)) + if (test_clear_page_dirty(page, 0)) task_io_account_cancelled_write(PAGE_CACHE_SIZE); } out: diff -puN fs/fuse/file.c~try_to_free_buffers-dont-clear-pte-dirty-bits fs/fuse/file.c --- a/fs/fuse/file.c~try_to_free_buffers-dont-clear-pte-dirty-bits +++ a/fs/fuse/file.c @@ -484,7 +484,7 @@ static int fuse_commit_write(struct file spin_unlock(&fc->lock); if (offset == 0 && to == PAGE_CACHE_SIZE) { - clear_page_dirty(page); + clear_page_dirty(page, 0); SetPageUptodate(page); } } diff -puN fs/hugetlbfs/inode.c~try_to_free_buffers-dont-clear-pte-dirty-bits fs/hugetlbfs/inode.c --- a/fs/hugetlbfs/inode.c~try_to_free_buffers-dont-clear-pte-dirty-bits +++ a/fs/hugetlbfs/inode.c @@ -176,7 +176,7 @@ static int hugetlbfs_commit_write(struct static void truncate_huge_page(struct page *page) { - clear_page_dirty(page); + clear_page_dirty(page, 1); ClearPageUptodate(page); remove_from_page_cache(page); put_page(page); diff -puN fs/jfs/jfs_metapage.c~try_to_free_buffers-dont-clear-pte-dirty-bits fs/jfs/jfs_metapage.c --- a/fs/jfs/jfs_metapage.c~try_to_free_buffers-dont-clear-pte-dirty-bits +++ a/fs/jfs/jfs_metapage.c @@ -773,7 +773,7 @@ void release_metapage(struct metapage * /* Retest mp->count since we may have released page lock */ if (test_bit(META_discard, &mp->flag) && !mp->count) { - clear_page_dirty(page); + clear_page_dirty(page, 1); ClearPageUptodate(page); } #else diff -puN fs/reiserfs/stree.c~try_to_free_buffers-dont-clear-pte-dirty-bits fs/reiserfs/stree.c --- a/fs/reiserfs/stree.c~try_to_free_buffers-dont-clear-pte-dirty-bits +++ a/fs/reiserfs/stree.c @@ -1459,7 +1459,7 @@ static void unmap_buffers(struct page *p bh = next; } while (bh != head); if (PAGE_SIZE == bh->b_size) { - clear_page_dirty(page); + clear_page_dirty(page, 0); } } } diff -puN fs/xfs/linux-2.6/xfs_aops.c~try_to_free_buffers-dont-clear-pte-dirty-bits fs/xfs/linux-2.6/xfs_aops.c --- a/fs/xfs/linux-2.6/xfs_aops.c~try_to_free_buffers-dont-clear-pte-dirty-bits +++ a/fs/xfs/linux-2.6/xfs_aops.c @@ -343,7 +343,7 @@ xfs_start_page_writeback( ASSERT(!PageWriteback(page)); set_page_writeback(page); if (clear_dirty) - clear_page_dirty(page); + clear_page_dirty(page, 1); unlock_page(page); if (!buffers) { end_page_writeback(page); diff -puN include/linux/page-flags.h~try_to_free_buffers-dont-clear-pte-dirty-bits include/linux/page-flags.h --- a/include/linux/page-flags.h~try_to_free_buffers-dont-clear-pte-dirty-bits +++ a/include/linux/page-flags.h @@ -253,13 +253,13 @@ static inline void SetPageUptodate(struc struct page; /* forward declaration */ -int test_clear_page_dirty(struct page *page); +int test_clear_page_dirty(struct page *page, int must_clean_ptes); int test_clear_page_writeback(struct page *page); int test_set_page_writeback(struct page *page); -static inline void clear_page_dirty(struct page *page) +static inline void clear_page_dirty(struct page *page, int must_clean_ptes) { - test_clear_page_dirty(page); + test_clear_page_dirty(page, must_clean_ptes); } static inline void set_page_writeback(struct page *page) diff -puN mm/page-writeback.c~try_to_free_buffers-dont-clear-pte-dirty-bits mm/page-writeback.c --- a/mm/page-writeback.c~try_to_free_buffers-dont-clear-pte-dirty-bits +++ a/mm/page-writeback.c @@ -848,7 +848,7 @@ EXPORT_SYMBOL(set_page_dirty_lock); * Clear a page's dirty flag, while caring for dirty memory accounting. * Returns true if the page was previously dirty. */ -int test_clear_page_dirty(struct page *page) +int test_clear_page_dirty(struct page *page, int must_clean_ptes) { struct address_space *mapping = page_mapping(page); unsigned long flags; @@ -866,7 +866,8 @@ int test_clear_page_dirty(struct page *p * page is locked, which pins the address_space */ if (mapping_cap_account_dirty(mapping)) { - page_mkclean(page); + if (must_clean_ptes) + page_mkclean(page); dec_zone_page_state(page, NR_FILE_DIRTY); } return 1; diff -puN mm/truncate.c~try_to_free_buffers-dont-clear-pte-dirty-bits mm/truncate.c --- a/mm/truncate.c~try_to_free_buffers-dont-clear-pte-dirty-bits +++ a/mm/truncate.c @@ -70,7 +70,7 @@ truncate_complete_page(struct address_sp if (PagePrivate(page)) do_invalidatepage(page, 0); - if (test_clear_page_dirty(page)) + if (test_clear_page_dirty(page, 1)) task_io_account_cancelled_write(PAGE_CACHE_SIZE); ClearPageUptodate(page); ClearPageMappedToDisk(page); @@ -386,7 +386,7 @@ int invalidate_inode_pages2_range(struct PAGE_CACHE_SIZE, 0); } } - was_dirty = test_clear_page_dirty(page); + was_dirty = test_clear_page_dirty(page, 0); if (!invalidate_complete_page2(mapping, page)) { if (was_dirty) set_page_dirty(page); diff -puN fs/cifs/file.c~try_to_free_buffers-dont-clear-pte-dirty-bits fs/cifs/file.c --- a/fs/cifs/file.c~try_to_free_buffers-dont-clear-pte-dirty-bits +++ a/fs/cifs/file.c @@ -1245,7 +1245,7 @@ retry: wait_on_page_writeback(page); if (PageWriteback(page) || - !test_clear_page_dirty(page)) { + !test_clear_page_dirty(page, 1)) { unlock_page(page); break; } _ ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-17 23:40 ` Andrew Morton @ 2006-12-18 1:02 ` Linus Torvalds 2006-12-18 1:22 ` Linus Torvalds 2006-12-18 16:55 ` Peter Zijlstra 2 siblings, 0 replies; 311+ messages in thread From: Linus Torvalds @ 2006-12-18 1:02 UTC (permalink / raw) To: Andrew Morton Cc: andrei.popa, Linux Kernel Mailing List, Peter Zijlstra, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Sun, 17 Dec 2006, Andrew Morton wrote: > > So this patch instead arranges for clear_page_dirty() to not clean the pte's > when it is called on the try_to_free_buffers() path. No. This is wrong. It's wrong exactly because it now _breaks_ the whole thing that the 2.6.19 PG_dirty changes were all about: keeping track of dirty pages. Now you have a page that is dirty, but it's no longer marked PG_dirty, and thus it doesn't participate in the dirty accounting. > From my quick reading, all callers of try_to_free_buffers() have already > unmapped the page from pagetables, and given that the reported ext3 corruption > happens on uniprocessor, non-preempt kernels, I doubt if this patch will fix > things. So not only are you breaking this, you also claim that it cannot happen in the first place. So either the patch is buggy, or it's pointless. In neither case does it seem to be a good idea to do. Linus ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-17 23:40 ` Andrew Morton 2006-12-18 1:02 ` Linus Torvalds @ 2006-12-18 1:22 ` Linus Torvalds 2006-12-18 1:29 ` Linus Torvalds 2006-12-18 16:55 ` Peter Zijlstra 2 siblings, 1 reply; 311+ messages in thread From: Linus Torvalds @ 2006-12-18 1:22 UTC (permalink / raw) To: Andrew Morton Cc: andrei.popa, Linux Kernel Mailing List, Peter Zijlstra, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Sun, 17 Dec 2006, Andrew Morton wrote: > > From my quick reading, all callers of try_to_free_buffers() have already > unmapped the page from pagetables, and given that the reported ext3 corruption > happens on uniprocessor, non-preempt kernels, I doubt if this patch will fix > things. Hmm. One possible explanation: maybe the page actually _did_ get unmapped from the page tables, but got added back? I don't think we lock the page when faulting it in (we want it to be uptodate, but not necessarily locked). So assuming the pageout sequence always _does_ follow the rule that it only does try_to_free_buffers() on pages that aren't mapped, what actually protects them from not becoming mapped (and dirtied) during that sequence? So we should probably do a "wait_for_page()" in do_no_page()? Or maybe only do it for write accesses (since we don't really care about getting mapped readably)? If so, we need to do it in the write case of do_no_page() _and_ in the do_wp_page() case. Hmm? Linus ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-18 1:22 ` Linus Torvalds @ 2006-12-18 1:29 ` Linus Torvalds 2006-12-18 1:57 ` Linus Torvalds 0 siblings, 1 reply; 311+ messages in thread From: Linus Torvalds @ 2006-12-18 1:29 UTC (permalink / raw) To: Andrew Morton Cc: andrei.popa, Linux Kernel Mailing List, Peter Zijlstra, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Sun, 17 Dec 2006, Linus Torvalds wrote: > > So we should probably do a "wait_for_page()" in do_no_page()? > > Or maybe only do it for write accesses (since we don't really care about > getting mapped readably)? If so, we need to do it in the write case of > do_no_page() _and_ in the do_wp_page() case. Hmm? I think we discussed doing exactly this at some earlier time, actually, just to try to throttle people who do lots of page dirtying. Maybe we even do it somewhere, but I tried to see it, and in the normal "nopage()" routine we very much try to _avoid_ locking the page (ie if it's marked PageUptodate() we'll return it whether locked or not). Which is fine - especially for readers, there really isn't any reason to ever delay them getting access to a page just because it's locked for write-out or something (once it's mapped, they'll have access to it regardless of any locked state in the kernel anyway). So I don't actually see any serialization at all that would keep a random page from being paged back in. Linus ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-18 1:29 ` Linus Torvalds @ 2006-12-18 1:57 ` Linus Torvalds 2006-12-18 4:51 ` Nick Piggin 0 siblings, 1 reply; 311+ messages in thread From: Linus Torvalds @ 2006-12-18 1:57 UTC (permalink / raw) To: Andrew Morton Cc: andrei.popa, Linux Kernel Mailing List, Peter Zijlstra, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr [ Replying to myself - a sure sign that I don't get out enough ] On Sun, 17 Dec 2006, Linus Torvalds wrote: > > So I don't actually see any serialization at all that would keep a random > page from being paged back in. We do actually serialize, but we do it _after_ the page has already been mapped back. Ie we do it for the dirty page case at rthe end of do_wp_page() and do_no_page() when we do the "set_page_dirty_balance()", but that's potentially too late - we've already mapped the page read-write into the address space. That said, this means that only threaded apps should ever trigger any problems, which would seem to make it unlikely that this is the issue. But Andrew: I don't think it's necessarily true that "try_to_free_buffers()" callers have all unmapped the page. That seems to be true for vmscan.c (ie the shrink_page_list -> try_to_release_page -> try_to_release_buffers callchain), but what about the other callchains (through filesystems, or through "pagevec_strip()" or similar?) That pagevec_strip() is called from shrink_active_list(), I don't see that unmapping the pages.. Linus ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-18 1:57 ` Linus Torvalds @ 2006-12-18 4:51 ` Nick Piggin 2006-12-18 5:43 ` Andrew Morton 2006-12-18 5:50 ` Linus Torvalds 0 siblings, 2 replies; 311+ messages in thread From: Nick Piggin @ 2006-12-18 4:51 UTC (permalink / raw) To: Linus Torvalds Cc: Andrew Morton, andrei.popa, Linux Kernel Mailing List, Peter Zijlstra, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr Linus Torvalds wrote: > [ Replying to myself - a sure sign that I don't get out enough ] > > On Sun, 17 Dec 2006, Linus Torvalds wrote: > >>So I don't actually see any serialization at all that would keep a random >>page from being paged back in. > > > We do actually serialize, but we do it _after_ the page has already been > mapped back. Ie we do it for the dirty page case at rthe end of > do_wp_page() and do_no_page() when we do the "set_page_dirty_balance()", > but that's potentially too late - we've already mapped the page read-write > into the address space. I can't see how that's exactly a problem -- so long as the page does not get reclaimed (it won't, because we have a ref on it) then all that matters is that the page eventually gets marked dirty. > That said, this means that only threaded apps should ever trigger any > problems, which would seem to make it unlikely that this is the issue. > > But Andrew: I don't think it's necessarily true that > "try_to_free_buffers()" callers have all unmapped the page. > > That seems to be true for vmscan.c (ie the shrink_page_list -> > try_to_release_page -> try_to_release_buffers callchain), but what about > the other callchains (through filesystems, or through "pagevec_strip()" or > similar?) That pagevec_strip() is called from shrink_active_list(), I > don't see that unmapping the pages.. Right. But would it really matter whether they are currently mapped or not, given that we agree it may become mapped at any point? I think the problem Andrew identified is real. The issue is the disconnect between the pte dirtiness and a filesystem bringing buffers clean. But I disagree with his fix, because we don't actually want to just throw out that pte dirtiness information: we're just trying to get the PG_dirty bit into synch with what the buffers are telling us, not actually clean or dirty anything, as such. Can we clear the page dirty bit, then run set_page_dirty afterwards, if any dirty ptes are found? The other thing we might be able to do is to skip doing the clear_page_dirty if the page is uptodate. This feels more hackish but it might be faster? -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-18 4:51 ` Nick Piggin @ 2006-12-18 5:43 ` Andrew Morton 2006-12-18 7:22 ` Nick Piggin 2006-12-19 8:51 ` Marc Haber 2006-12-18 5:50 ` Linus Torvalds 1 sibling, 2 replies; 311+ messages in thread From: Andrew Morton @ 2006-12-18 5:43 UTC (permalink / raw) To: Nick Piggin Cc: Linus Torvalds, andrei.popa, Linux Kernel Mailing List, Peter Zijlstra, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Mon, 18 Dec 2006 15:51:52 +1100 Nick Piggin <nickpiggin@yahoo.com.au> wrote: > I think the problem Andrew identified is real. I don't. In fact I don't think I described any problem (well, I tried to, but then I contradicted myself). Six hours here of fsx-linux plus high memory pressure on SMP on 1k blocksize ext3, mainline. Zero failures. It's unlikely that this testing would pass, yet people running normal workloads are able to easily trigger failures. I suspect we're looking in the wrong place. > The issue is the disconnect between the pte dirtiness and a filesystem > bringing buffers clean. Really? The dirtying direction goes pte_dirty->PG_dirty->BH_Dirty and the cleaning direction goes !BH_Dirty->!PG_dirty->!pte_dirty. That's pretty simple, setting aside races. In the try_to_free_buffers case there's a large time inverval between !BH_Dirty and !PG_dirty, but that shouldn't affect anything. I don't think we even have a theory as to what's gone wrong yet. ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-18 5:43 ` Andrew Morton @ 2006-12-18 7:22 ` Nick Piggin 2006-12-18 9:18 ` Andrew Morton 2006-12-19 8:51 ` Marc Haber 1 sibling, 1 reply; 311+ messages in thread From: Nick Piggin @ 2006-12-18 7:22 UTC (permalink / raw) To: Andrew Morton Cc: Linus Torvalds, andrei.popa, Linux Kernel Mailing List, Peter Zijlstra, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr Andrew Morton wrote: > On Mon, 18 Dec 2006 15:51:52 +1100 > Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > >>I think the problem Andrew identified is real. > > > I don't. In fact I don't think I described any problem (well, I tried to, > but then I contradicted myself). By saying that there shouldn't be any dirty ptes if there are no dirty buffers? But in that case the _page_ shouldn't be dirty either, so that clear_page_dirty would be redundant. But presumably it isn't. > Six hours here of fsx-linux plus high memory pressure on SMP on 1k > blocksize ext3, mainline. Zero failures. It's unlikely that this testing > would pass, yet people running normal workloads are able to easily trigger > failures. I suspect we're looking in the wrong place. Yes I could believe it the corruption is caused by something else completely. >>The issue is the disconnect between the pte dirtiness and a filesystem >>bringing buffers clean. > > > Really? The dirtying direction goes pte_dirty->PG_dirty->BH_Dirty and the > cleaning direction goes !BH_Dirty->!PG_dirty->!pte_dirty. That's pretty > simple, setting aside races. > > In the try_to_free_buffers case there's a large time inverval between > !BH_Dirty and !PG_dirty, but that shouldn't affect anything. After try_to_free_buffers detaches the buffers from the page, a pagefault can come in, and mark the pte writeable, then set_page_dirty (which finds no buffers, so only sets PG_dirty). The page can now get dirtied through this mapping. try_to_free_buffers then goes on to clean the page and ptes. I really thought you were the one who identified this race, and I didn't see where you showed it is safe. It may be very unlikely with small SMPs, but less so with preempt. All we have to do is preempt at spin_unlock in try_to_free_buffers AFAIKS. Were you testing with preempt? -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-18 7:22 ` Nick Piggin @ 2006-12-18 9:18 ` Andrew Morton 2006-12-18 9:26 ` Andrei Popa 2006-12-18 9:42 ` Nick Piggin 0 siblings, 2 replies; 311+ messages in thread From: Andrew Morton @ 2006-12-18 9:18 UTC (permalink / raw) To: Nick Piggin Cc: Linus Torvalds, andrei.popa, Linux Kernel Mailing List, Peter Zijlstra, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Mon, 18 Dec 2006 18:22:42 +1100 Nick Piggin <nickpiggin@yahoo.com.au> wrote: > Andrew Morton wrote: > > On Mon, 18 Dec 2006 15:51:52 +1100 > > Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > > > > >>I think the problem Andrew identified is real. > > > > > > I don't. In fact I don't think I described any problem (well, I tried to, > > but then I contradicted myself). > > By saying that there shouldn't be any dirty ptes if there are no > dirty buffers? But in that case the _page_ shouldn't be dirty either, > so that clear_page_dirty would be redundant. But presumably it isn't. I don't follow that. The linkage between pte-dirtiness and buffer_heads is a bit hard to follow without also considering page-dirtiness. > > Six hours here of fsx-linux plus high memory pressure on SMP on 1k > > blocksize ext3, mainline. Zero failures. It's unlikely that this testing > > would pass, yet people running normal workloads are able to easily trigger > > failures. I suspect we're looking in the wrong place. > > Yes I could believe it the corruption is caused by something else > completely. Think so. We do have a problem here, but only on threaded apps, I believe. rtorrent doesn't appear to be threaded, and the bug is hit on non-preempt UP. > >>The issue is the disconnect between the pte dirtiness and a filesystem > >>bringing buffers clean. > > > > > > Really? The dirtying direction goes pte_dirty->PG_dirty->BH_Dirty and the > > cleaning direction goes !BH_Dirty->!PG_dirty->!pte_dirty. That's pretty > > simple, setting aside races. > > > > In the try_to_free_buffers case there's a large time inverval between > > !BH_Dirty and !PG_dirty, but that shouldn't affect anything. > > After try_to_free_buffers detaches the buffers from the page, a > pagefault can come in, and mark the pte writeable, then set_page_dirty > (which finds no buffers, so only sets PG_dirty). > > The page can now get dirtied through this mapping. > > try_to_free_buffers then goes on to clean the page and ptes. try_to_free_buffers() isn't called against a page which doesn't have buffers. It'll oops. > Were you testing with preempt? nope, just SMP. ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-18 9:18 ` Andrew Morton @ 2006-12-18 9:26 ` Andrei Popa 2006-12-18 9:42 ` Nick Piggin 1 sibling, 0 replies; 311+ messages in thread From: Andrei Popa @ 2006-12-18 9:26 UTC (permalink / raw) To: Andrew Morton Cc: Nick Piggin, Linus Torvalds, Linux Kernel Mailing List, Peter Zijlstra, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Mon, 2006-12-18 at 01:18 -0800, Andrew Morton wrote: > On Mon, 18 Dec 2006 18:22:42 +1100 > Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > > Andrew Morton wrote: > > > On Mon, 18 Dec 2006 15:51:52 +1100 > > > Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > > > > > > > >>I think the problem Andrew identified is real. > > > > > > > > > I don't. In fact I don't think I described any problem (well, I tried to, > > > but then I contradicted myself). > > > > By saying that there shouldn't be any dirty ptes if there are no > > dirty buffers? But in that case the _page_ shouldn't be dirty either, > > so that clear_page_dirty would be redundant. But presumably it isn't. > > I don't follow that. > > The linkage between pte-dirtiness and buffer_heads is a bit hard to follow > without also considering page-dirtiness. > > > > Six hours here of fsx-linux plus high memory pressure on SMP on 1k > > > blocksize ext3, mainline. Zero failures. It's unlikely that this testing > > > would pass, yet people running normal workloads are able to easily trigger > > > failures. I suspect we're looking in the wrong place. > > > > Yes I could believe it the corruption is caused by something else > > completely. > > Think so. We do have a problem here, but only on threaded apps, I believe. > rtorrent doesn't appear to be threaded, and the bug is hit on non-preempt > UP. ierdnac ~ # uname -a Linux ierdnac 2.6.20-rc1 #2 SMP PREEMPT Mon Dec 18 11:01:52 EET 2006 i686 Genuine Intel(R) CPU T2050 @ 1.60GHz GenuineIntel GNU/Linux and the other person who had corruption with rtorrent has also SMP and PREEMPT. > > > >>The issue is the disconnect between the pte dirtiness and a filesystem > > >>bringing buffers clean. > > > > > > > > > Really? The dirtying direction goes pte_dirty->PG_dirty->BH_Dirty and the > > > cleaning direction goes !BH_Dirty->!PG_dirty->!pte_dirty. That's pretty > > > simple, setting aside races. > > > > > > In the try_to_free_buffers case there's a large time inverval between > > > !BH_Dirty and !PG_dirty, but that shouldn't affect anything. > > > > After try_to_free_buffers detaches the buffers from the page, a > > pagefault can come in, and mark the pte writeable, then set_page_dirty > > (which finds no buffers, so only sets PG_dirty). > > > > The page can now get dirtied through this mapping. > > > > try_to_free_buffers then goes on to clean the page and ptes. > > try_to_free_buffers() isn't called against a page which doesn't have > buffers. It'll oops. > > > Were you testing with preempt? > > nope, just SMP. > ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-18 9:18 ` Andrew Morton 2006-12-18 9:26 ` Andrei Popa @ 2006-12-18 9:42 ` Nick Piggin 1 sibling, 0 replies; 311+ messages in thread From: Nick Piggin @ 2006-12-18 9:42 UTC (permalink / raw) To: Andrew Morton Cc: Linus Torvalds, andrei.popa, Linux Kernel Mailing List, Peter Zijlstra, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr Andrew Morton wrote: > On Mon, 18 Dec 2006 18:22:42 +1100 > Nick Piggin <nickpiggin@yahoo.com.au> wrote: >>Yes I could believe it the corruption is caused by something else >>completely. > > > Think so. We do have a problem here, but only on threaded apps, I believe. > rtorrent doesn't appear to be threaded, and the bug is hit on non-preempt > UP. I think (see below) that it does not apply only to threaded apps. But it would need one of SMP or PREEMPT to trigger. >>After try_to_free_buffers detaches the buffers from the page, a >>pagefault can come in, and mark the pte writeable, then set_page_dirty >>(which finds no buffers, so only sets PG_dirty). >> >>The page can now get dirtied through this mapping. >> >>try_to_free_buffers then goes on to clean the page and ptes. > > > try_to_free_buffers() isn't called against a page which doesn't have > buffers. It'll oops. Sure. But I think the race exists... I'll try spelling it out in the conventional way: try_to_free_buffers() drop_buffers() (succeeds) ** preempt here or run right-hand thread on 2nd CPU in SMP ** do_no_page() set_page_dirty() [now modify the page via this mapping (from this process or a concurrent thread)] clear_page_dirty() (clears PG_dirty + pte dirty, oops) -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-18 5:43 ` Andrew Morton 2006-12-18 7:22 ` Nick Piggin @ 2006-12-19 8:51 ` Marc Haber 2006-12-19 9:28 ` Martin Michlmayr 2006-12-28 18:05 ` Marc Haber 1 sibling, 2 replies; 311+ messages in thread From: Marc Haber @ 2006-12-19 8:51 UTC (permalink / raw) To: Andrew Morton Cc: Nick Piggin, Linus Torvalds, andrei.popa, Linux Kernel Mailing List, Peter Zijlstra, Hugh Dickins, Florian Weimer, Martin Michlmayr On Sun, Dec 17, 2006 at 09:43:08PM -0800, Andrew Morton wrote: > Six hours here of fsx-linux plus high memory pressure on SMP on 1k > blocksize ext3, mainline. Zero failures. It's unlikely that this testing > would pass, yet people running normal workloads are able to easily trigger > failures. I suspect we're looking in the wrong place. I do not have a clue about memory management at all, but is it possible that you're testing on a box with too much memory? My box has only 256 MB, and I used to use mutt with a _huge_ inbox with mutt taking somewhat 150 MB. Add spamassassin and a reasonably busy mail server, and the box used to be like 150 MB in swap. I have tidied my inbox in the mean time and mutt's memory requirement has been reduced to somewhat 30 MB, which might be the cause that I don't see the issue that often any more. Greetings Marc, just trying to give input -- ----------------------------------------------------------------------------- Marc Haber | "I don't trust Computers. They | Mailadresse im Header Mannheim, Germany | lose things." Winona Ryder | Fon: *49 621 72739834 Nordisch by Nature | How to make an American Quilt | Fax: *49 621 72739835 ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-19 8:51 ` Marc Haber @ 2006-12-19 9:28 ` Martin Michlmayr 2006-12-28 18:05 ` Marc Haber 1 sibling, 0 replies; 311+ messages in thread From: Martin Michlmayr @ 2006-12-19 9:28 UTC (permalink / raw) To: Marc Haber Cc: Andrew Morton, Nick Piggin, Linus Torvalds, andrei.popa, Linux Kernel Mailing List, Peter Zijlstra, Hugh Dickins, Florian Weimer * Marc Haber <mh+linux-kernel@zugschlus.de> [2006-12-19 09:51]: > I do not have a clue about memory management at all, but is it > possible that you're testing on a box with too much memory? My box has > only 256 MB, and I used to use mutt with a _huge_ inbox with mutt > taking somewhat 150 MB. Add spamassassin and a reasonably busy mail > server, and the box used to be like 150 MB in swap. FWIW, the ARM box I see this on has only 32 MB memory (and a 133 or 266 MHz CPU). I don't see it on another ARM box (different ARM sub-arch) with 128 MB memory and a 600 MHz CPU. -- Martin Michlmayr http://www.cyrius.com/ ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-19 8:51 ` Marc Haber 2006-12-19 9:28 ` Martin Michlmayr @ 2006-12-28 18:05 ` Marc Haber 2006-12-28 19:00 ` Linus Torvalds 1 sibling, 1 reply; 311+ messages in thread From: Marc Haber @ 2006-12-28 18:05 UTC (permalink / raw) To: Andrew Morton, Nick Piggin, Linus Torvalds, andrei.popa, Linux Kernel Mailing List, Peter Zijlstra, Hugh Dickins, Florian Weimer, Martin Michlmayr On Tue, Dec 19, 2006 at 09:51:49AM +0100, Marc Haber wrote: > On Sun, Dec 17, 2006 at 09:43:08PM -0800, Andrew Morton wrote: > > Six hours here of fsx-linux plus high memory pressure on SMP on 1k > > blocksize ext3, mainline. Zero failures. It's unlikely that this testing > > would pass, yet people running normal workloads are able to easily trigger > > failures. I suspect we're looking in the wrong place. > > I do not have a clue about memory management at all, but is it > possible that you're testing on a box with too much memory? My box has > only 256 MB, and I used to use mutt with a _huge_ inbox with mutt > taking somewhat 150 MB. Add spamassassin and a reasonably busy mail > server, and the box used to be like 150 MB in swap. > > I have tidied my inbox in the mean time and mutt's memory requirement > has been reduced to somewhat 30 MB, which might be the cause that I > don't see the issue that often any more. After being up for ten days, I have now encountered the file corruption of pkgcache.bin for the first time again. The 256 MB i386 box is like 26M in swap, is under very moderate load. I am running plain vanilla 2.6.19.1. Is there a patch that I should apply against 2.6.19.1 that would help in debugging? Greetings Marc -- ----------------------------------------------------------------------------- Marc Haber | "I don't trust Computers. They | Mailadresse im Header Mannheim, Germany | lose things." Winona Ryder | Fon: *49 621 72739834 Nordisch by Nature | How to make an American Quilt | Fax: *49 621 72739835 ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-28 18:05 ` Marc Haber @ 2006-12-28 19:00 ` Linus Torvalds 2006-12-28 19:05 ` Petri Kaukasoina ` (2 more replies) 0 siblings, 3 replies; 311+ messages in thread From: Linus Torvalds @ 2006-12-28 19:00 UTC (permalink / raw) To: Marc Haber Cc: Andrew Morton, Nick Piggin, andrei.popa, Linux Kernel Mailing List, Peter Zijlstra, Hugh Dickins, Florian Weimer, Martin Michlmayr On Thu, 28 Dec 2006, Marc Haber wrote: > > After being up for ten days, I have now encountered the file > corruption of pkgcache.bin for the first time again. The 256 MB i386 > box is like 26M in swap, is under very moderate load. > > I am running plain vanilla 2.6.19.1. Is there a patch that I should > apply against 2.6.19.1 that would help in debugging? Not right now. And I have a test-program that shows the corruption _much_ easier (at least according to my own testing, and that of several reporters that back me up), and that seems to show the corruption going way way back (ie going back to Linux-2.6.5 at least, according to one tester). So it just got a lot _easier_ to trigger in 2.6.19, but it's not a new bug. What we need now is actually looking at the source code, and people who understand the VM, I'm afraid. I'm gathering traces now that I have a good test-case. I'll post my trace tools once I've tested that they work, in case others want to help. (And hey, you don't have to be a VM expert to help: this could be a learning experience. However, I'll warn you: this is _the_ most grotty part of the whole kernel. It's not even ugly, it's just damn hard and complex). Linus ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-28 19:00 ` Linus Torvalds @ 2006-12-28 19:05 ` Petri Kaukasoina 2006-12-28 19:21 ` Linus Torvalds 2006-12-28 21:24 ` Linus Torvalds 2006-12-29 17:49 ` Guillaume Chazarain 2 siblings, 1 reply; 311+ messages in thread From: Petri Kaukasoina @ 2006-12-28 19:05 UTC (permalink / raw) To: Linus Torvalds Cc: Marc Haber, Andrew Morton, Nick Piggin, andrei.popa, Linux Kernel Mailing List, Peter Zijlstra, Hugh Dickins, Florian Weimer, Martin Michlmayr On Thu, Dec 28, 2006 at 11:00:46AM -0800, Linus Torvalds wrote: > And I have a test-program that shows the corruption _much_ easier (at > least according to my own testing, and that of several reporters that back > me up), and that seems to show the corruption going way way back (ie going > back to Linux-2.6.5 at least, according to one tester). That was a Fedora kernel. Has anyone seen the corruption in vanilla 2.6.18 (or older)? ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-28 19:05 ` Petri Kaukasoina @ 2006-12-28 19:21 ` Linus Torvalds 2006-12-28 19:39 ` Dave Jones 0 siblings, 1 reply; 311+ messages in thread From: Linus Torvalds @ 2006-12-28 19:21 UTC (permalink / raw) To: Petri Kaukasoina Cc: Marc Haber, Andrew Morton, Nick Piggin, andrei.popa, Linux Kernel Mailing List, Peter Zijlstra, Hugh Dickins, Florian Weimer, Martin Michlmayr On Thu, 28 Dec 2006, Petri Kaukasoina wrote: > > me up), and that seems to show the corruption going way way back (ie going > > back to Linux-2.6.5 at least, according to one tester). > > That was a Fedora kernel. Has anyone seen the corruption in vanilla 2.6.18 > (or older)? Well, that was a really _old_ fedora kernel. I guarantee you it didn't have the page throttling patches in it, those were written this summer. So it would either have to be Fedora carrying around another patch that just happens to result in the same corruption for _years_, or it's the same bug. I bet it's the same bug, and it's been around for ages. Linus ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-28 19:21 ` Linus Torvalds @ 2006-12-28 19:39 ` Dave Jones 2006-12-28 20:10 ` Arjan van de Ven 2006-12-29 9:23 ` maximilian attems 0 siblings, 2 replies; 311+ messages in thread From: Dave Jones @ 2006-12-28 19:39 UTC (permalink / raw) To: Linus Torvalds Cc: Petri Kaukasoina, Marc Haber, Andrew Morton, Nick Piggin, andrei.popa, Linux Kernel Mailing List, Peter Zijlstra, Hugh Dickins, Florian Weimer, Martin Michlmayr On Thu, Dec 28, 2006 at 11:21:21AM -0800, Linus Torvalds wrote: > > > On Thu, 28 Dec 2006, Petri Kaukasoina wrote: > > > me up), and that seems to show the corruption going way way back (ie going > > > back to Linux-2.6.5 at least, according to one tester). > > > > That was a Fedora kernel. Has anyone seen the corruption in vanilla 2.6.18 > > (or older)? > > Well, that was a really _old_ fedora kernel. I guarantee you it didn't > have the page throttling patches in it, those were written this summer. So > it would either have to be Fedora carrying around another patch that just > happens to result in the same corruption for _years_, or it's the same > bug. The only notable VM patch in Fedora kernels of that vintage that I recall was Ingo's 4g/4g thing. Dave -- http://www.codemonkey.org.uk ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-28 19:39 ` Dave Jones @ 2006-12-28 20:10 ` Arjan van de Ven 2006-12-29 9:23 ` maximilian attems 1 sibling, 0 replies; 311+ messages in thread From: Arjan van de Ven @ 2006-12-28 20:10 UTC (permalink / raw) To: Dave Jones Cc: Linus Torvalds, Petri Kaukasoina, Marc Haber, Andrew Morton, Nick Piggin, andrei.popa, Linux Kernel Mailing List, Peter Zijlstra, Hugh Dickins, Florian Weimer, Martin Michlmayr On Thu, 2006-12-28 at 14:39 -0500, Dave Jones wrote: > On Thu, Dec 28, 2006 at 11:21:21AM -0800, Linus Torvalds wrote: > > > > > > On Thu, 28 Dec 2006, Petri Kaukasoina wrote: > > > > me up), and that seems to show the corruption going way way back (ie going > > > > back to Linux-2.6.5 at least, according to one tester). > > > > > > That was a Fedora kernel. Has anyone seen the corruption in vanilla 2.6.18 > > > (or older)? > > > > Well, that was a really _old_ fedora kernel. I guarantee you it didn't > > have the page throttling patches in it, those were written this summer. So > > it would either have to be Fedora carrying around another patch that just > > happens to result in the same corruption for _years_, or it's the same > > bug. > > The only notable VM patch in Fedora kernels of that vintage that I recall > was Ingo's 4g/4g thing. which does tlb flushes *all the time* so that even rules out (well almost) a stale tlb somewhere... ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-28 19:39 ` Dave Jones 2006-12-28 20:10 ` Arjan van de Ven @ 2006-12-29 9:23 ` maximilian attems 2006-12-29 15:02 ` Dave Jones 1 sibling, 1 reply; 311+ messages in thread From: maximilian attems @ 2006-12-29 9:23 UTC (permalink / raw) To: davej; +Cc: linux-kernel > On Thu, Dec 28, 2006 at 11:21:21AM -0800, Linus Torvalds wrote: > > > > > > On Thu, 28 Dec 2006, Petri Kaukasoina wrote: > > > > me up), and that seems to show the corruption going way way back (ie going > > > > back to Linux-2.6.5 at least, according to one tester). > > > > > > That was a Fedora kernel. Has anyone seen the corruption in vanilla 2.6.18 > > > (or older)? > > > > Well, that was a really _old_ fedora kernel. I guarantee you it didn't > > have the page throttling patches in it, those were written this summer. So > > it would either have to be Fedora carrying around another patch that just > > happens to result in the same corruption for _years_, or it's the same > > bug. > > The only notable VM patch in Fedora kernels of that vintage that I recall > was Ingo's 4g/4g thing. > > Dave no the fedora 2.6.18 kernel is affected. it carries the same -mm patches that Debian backported for LSB 3.1 compliance. -- maks ps sorry for stripping cc, only downloaded that message raw. ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-29 9:23 ` maximilian attems @ 2006-12-29 15:02 ` Dave Jones 2006-12-29 18:52 ` maximilian attems 0 siblings, 1 reply; 311+ messages in thread From: Dave Jones @ 2006-12-29 15:02 UTC (permalink / raw) To: maximilian attems; +Cc: linux-kernel On Fri, Dec 29, 2006 at 10:23:14AM +0100, maximilian attems wrote: > > On Thu, Dec 28, 2006 at 11:21:21AM -0800, Linus Torvalds wrote: > > > > > > > > > On Thu, 28 Dec 2006, Petri Kaukasoina wrote: > > > > > me up), and that seems to show the corruption going way way back (ie going > > > > > back to Linux-2.6.5 at least, according to one tester). > > > > > > > > That was a Fedora kernel. Has anyone seen the corruption in vanilla 2.6.18 > > > > (or older)? > > > > > > Well, that was a really _old_ fedora kernel. I guarantee you it didn't > > > have the page throttling patches in it, those were written this summer. So > > > it would either have to be Fedora carrying around another patch that just > > > happens to result in the same corruption for _years_, or it's the same > > > bug. > > > > The only notable VM patch in Fedora kernels of that vintage that I recall > > was Ingo's 4g/4g thing. > > no the fedora 2.6.18 kernel is affected. I wasn't denying that, but Linus was talking about a 2.6.5 Fedora kernel. > it carries the same -mm patches that Debian backported > for LSB 3.1 compliance. The only -mm stuff I recall being in the Fedora 2.6.18 is the inode-diet stuff which ended up in 2.6.19, though the xmas break has left my head somewhat empty so I may be forgetting something. What patch in particular are you talking about? Dave -- http://www.codemonkey.org.uk ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-29 15:02 ` Dave Jones @ 2006-12-29 18:52 ` maximilian attems 2006-12-29 19:14 ` Dave Jones 0 siblings, 1 reply; 311+ messages in thread From: maximilian attems @ 2006-12-29 18:52 UTC (permalink / raw) To: Dave Jones, linux-kernel On Fri, Dec 29, 2006 at 10:02:53AM -0500, Dave Jones wrote: > On Fri, Dec 29, 2006 at 10:23:14AM +0100, maximilian attems wrote: > > > On Thu, Dec 28, 2006 at 11:21:21AM -0800, Linus Torvalds wrote: <snipp> > > > > > That was a Fedora kernel. Has anyone seen the corruption in vanilla 2.6.18 > > > > > (or older)? > > > > > > > > Well, that was a really _old_ fedora kernel. I guarantee you it didn't > > > > have the page throttling patches in it, those were written this summer. So > > > > it would either have to be Fedora carrying around another patch that just > > > > happens to result in the same corruption for _years_, or it's the same > > > > bug. > > > > > > The only notable VM patch in Fedora kernels of that vintage that I recall > > > was Ingo's 4g/4g thing. > > > > no the fedora 2.6.18 kernel is affected. > > I wasn't denying that, but Linus was talking about a 2.6.5 Fedora kernel. > > > it carries the same -mm patches that Debian backported > > for LSB 3.1 compliance. > > The only -mm stuff I recall being in the Fedora 2.6.18 is > the inode-diet stuff which ended up in 2.6.19, though the xmas > break has left my head somewhat empty so I may be forgetting something. > What patch in particular are you talking about? it's no longer visible in the FC6 cvs, due to rebase but it's name was linux-2.6-mm-tracking-dirty-pages.patch it is an earlier almagame of the merged patch serie: - mm: tracking shared dirty pages - mm: balance dirty pages - mm: optimize the new mprotect() code a bit - mm: small cleanup of install_page() - mm: fixup do_wp_page() - mm: msync() cleanup (closes: #394392) -- maks ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-29 18:52 ` maximilian attems @ 2006-12-29 19:14 ` Dave Jones 0 siblings, 0 replies; 311+ messages in thread From: Dave Jones @ 2006-12-29 19:14 UTC (permalink / raw) To: maximilian attems; +Cc: linux-kernel On Fri, Dec 29, 2006 at 07:52:15PM +0100, maximilian attems wrote: > > The only -mm stuff I recall being in the Fedora 2.6.18 is > > the inode-diet stuff which ended up in 2.6.19, though the xmas > > break has left my head somewhat empty so I may be forgetting something. > > What patch in particular are you talking about? > > it's no longer visible in the FC6 cvs, due to rebase > but it's name was linux-2.6-mm-tracking-dirty-pages.patch > it is an earlier almagame of the merged patch serie: > - mm: tracking shared dirty pages > - mm: balance dirty pages > - mm: optimize the new mprotect() code a bit > - mm: small cleanup of install_page() > - mm: fixup do_wp_page() > - mm: msync() cleanup (closes: #394392) Ohh, that. Yes. I had forgotten all about that. I've been hitting the nog a little too hard :) Dave -- http://www.codemonkey.org.uk ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-28 19:00 ` Linus Torvalds 2006-12-28 19:05 ` Petri Kaukasoina @ 2006-12-28 21:24 ` Linus Torvalds 2006-12-28 21:36 ` Russell King 2006-12-28 22:37 ` Linus Torvalds 2006-12-29 17:49 ` Guillaume Chazarain 2 siblings, 2 replies; 311+ messages in thread From: Linus Torvalds @ 2006-12-28 21:24 UTC (permalink / raw) To: Marc Haber Cc: Andrew Morton, Nick Piggin, andrei.popa, Linux Kernel Mailing List, Peter Zijlstra, Hugh Dickins, Florian Weimer, Martin Michlmayr On Thu, 28 Dec 2006, Linus Torvalds wrote: > > What we need now is actually looking at the source code, and people who > understand the VM, I'm afraid. I'm gathering traces now that I have a good > test-case. I'll post my trace tools once I've tested that they work, in > case others want to help. Ok, I've got the traces, but quite frankly, I doubt anybody is crazy enough to want to trawl through them. It's a bit painful, since we're talking thousands of pages to trigger this problem. Also, I've used the PG_arch_1 flag, which is fine on x86[-64] and probably ARM, but is used for other things on ia64, powerpc and sparc64. But here's the patch in case anybody cares. It wants a _big_ kernel buffer to capture all the crud into (which is why I made the thing accept a bigger log buffer), and quite frankly, I'm not at all sure that all the locking is ok (ie I could imagine that the dcache-locking thing there in "is_interesting()" could deadlock, what do I know..) But I've captured some real data with this, which I'll describe separately. Linus ---- diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 350878a..967dd80 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -91,6 +91,8 @@ #define PG_nosave_free 18 /* Used for system suspend/resume */ #define PG_buddy 19 /* Page is free, on buddy lists */ +#define SetPageInteresting(page) set_bit(PG_arch_1, &(page)->flags) +#define PageInteresting(page) test_bit(PG_arch_1, &(page)->flags) #if (BITS_PER_LONG > 32) /* diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 5c26818..7735b83 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -79,7 +79,7 @@ config DEBUG_KERNEL config LOG_BUF_SHIFT int "Kernel log buffer size (16 => 64KB, 17 => 128KB)" if DEBUG_KERNEL - range 12 21 + range 12 24 default 17 if S390 || LOCKDEP default 16 if X86_NUMAQ || IA64 default 15 if SMP diff --git a/mm/filemap.c b/mm/filemap.c index 8332c77..d6a0f56 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -116,6 +116,7 @@ void __remove_from_page_cache(struct page *page) { struct address_space *mapping = page->mapping; +if (PageInteresting(page)) printk("Removing index %08x from page cache\n", page->index); radix_tree_delete(&mapping->page_tree, page->index); page->mapping = NULL; mapping->nrpages--; @@ -421,6 +422,23 @@ int filemap_write_and_wait_range(struct address_space *mapping, return err; } +static noinline int is_interesting(struct address_space *mapping) +{ + struct inode *inode = mapping->host; + struct dentry *dentry; + int retval = 0; + + spin_lock(&dcache_lock); + list_for_each_entry(dentry, &inode->i_dentry, d_alias) { + if (strcmp(dentry->d_name.name, "mapfile")) + continue; + retval = 1; + break; + } + spin_unlock(&dcache_lock); + return retval; +} + /** * add_to_page_cache - add newly allocated pagecache pages * @page: page to add @@ -439,6 +457,9 @@ int add_to_page_cache(struct page *page, struct address_space *mapping, { int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM); + if (is_interesting(mapping)) + SetPageInteresting(page); + if (error == 0) { write_lock_irq(&mapping->tree_lock); error = radix_tree_insert(&mapping->page_tree, offset, page); diff --git a/mm/memory.c b/mm/memory.c index 563792f..14c9815 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -667,6 +667,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, tlb_remove_tlb_entry(tlb, pte, addr); if (unlikely(!page)) continue; +if (PageInteresting(page)) + printk("Unmapped index %08x at %08x\n", page->index, addr); if (unlikely(details) && details->nonlinear_vma && linear_page_index(details->nonlinear_vma, addr) != page->index) @@ -1605,6 +1607,7 @@ gotten: */ ptep_clear_flush(vma, address, page_table); set_pte_at(mm, address, page_table, entry); +if (PageInteresting(new_page)) printk("do_wp_page: mapping index %08x at %08lx\n", new_page->index, address); update_mmu_cache(vma, address, entry); lru_cache_add_active(new_page); page_add_new_anon_rmap(new_page, vma, address); @@ -2249,6 +2252,7 @@ retry: entry = mk_pte(new_page, vma->vm_page_prot); if (write_access) entry = maybe_mkwrite(pte_mkdirty(entry), vma); +if (PageInteresting(new_page)) printk("do_no_page: mapping index %08x at %08lx (%s)\n", new_page->index, address, write_access ? "write" : "read"); set_pte_at(mm, address, page_table, entry); if (anon) { inc_mm_counter(mm, anon_rss); diff --git a/mm/page-writeback.c b/mm/page-writeback.c index b3a198c..0466601 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -813,6 +813,7 @@ int fastcall set_page_dirty(struct page *page) if (!spd) spd = __set_page_dirty_buffers; #endif +if (PageInteresting(page)) printk("Setting page %08x dirty\n", page->index); return (*spd)(page); } if (!PageDirty(page)) { @@ -867,6 +868,7 @@ int clear_page_dirty_for_io(struct page *page) if (TestClearPageDirty(page)) { if (mapping_cap_account_dirty(mapping)) { +if (PageInteresting(page)) printk("cpd_for_io: index %08x\n", page->index); page_mkclean(page); dec_zone_page_state(page, NR_FILE_DIRTY); } diff --git a/mm/rmap.c b/mm/rmap.c index 57306fa..e98e84c 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -448,6 +448,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma) if (pte_dirty(*pte) || pte_write(*pte)) { pte_t entry; +if (PageInteresting(page)) printk("cleaning index %08x at %08x\n", page->index, address); flush_cache_page(vma, address, pte_pfn(*pte)); entry = ptep_clear_flush(vma, address, pte); entry = pte_wrprotect(entry); @@ -637,6 +638,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, goto out_unmap; } +if (PageInteresting(page)) printk("unmapping index %08x from %08lx\n", page->index, address); /* Nuke the page table entry. */ flush_cache_page(vma, address, page_to_pfn(page)); pteval = ptep_clear_flush(vma, address, pte); @@ -767,6 +769,7 @@ static void try_to_unmap_cluster(unsigned long cursor, if (ptep_clear_flush_young(vma, address, pte)) continue; +if (PageInteresting(page)) printk("Cluster-unmapping %08x from %08lx\n", page->index, address); /* Nuke the page table entry. */ flush_cache_page(vma, address, pte_pfn(*pte)); pteval = ptep_clear_flush(vma, address, pte); ^ permalink raw reply related [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-28 21:24 ` Linus Torvalds @ 2006-12-28 21:36 ` Russell King 2006-12-28 22:37 ` Linus Torvalds 1 sibling, 0 replies; 311+ messages in thread From: Russell King @ 2006-12-28 21:36 UTC (permalink / raw) To: Linus Torvalds Cc: Marc Haber, Andrew Morton, Nick Piggin, andrei.popa, Linux Kernel Mailing List, Peter Zijlstra, Hugh Dickins, Florian Weimer, Martin Michlmayr On Thu, Dec 28, 2006 at 01:24:30PM -0800, Linus Torvalds wrote: > On Thu, 28 Dec 2006, Linus Torvalds wrote: > > > > What we need now is actually looking at the source code, and people who > > understand the VM, I'm afraid. I'm gathering traces now that I have a good > > test-case. I'll post my trace tools once I've tested that they work, in > > case others want to help. > > Ok, I've got the traces, but quite frankly, I doubt anybody is crazy > enough to want to trawl through them. It's a bit painful, since we're > talking thousands of pages to trigger this problem. > > Also, I've used the PG_arch_1 flag, which is fine on x86[-64] and probably > ARM, but is used for other things on ia64, powerpc and sparc64. But here's > the patch in case anybody cares. PG_arch_1 is used on ARM to flag pages that need a dcache flush prior to hitting userspace, in the same way that sparc64 uses it. So ARM systems should not have this patch applied. -- Russell King Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-28 21:24 ` Linus Torvalds 2006-12-28 21:36 ` Russell King @ 2006-12-28 22:37 ` Linus Torvalds 2006-12-28 22:50 ` David Miller 2006-12-28 23:36 ` Anton Altaparmakov 1 sibling, 2 replies; 311+ messages in thread From: Linus Torvalds @ 2006-12-28 22:37 UTC (permalink / raw) To: Andrew Morton Cc: Guillaume Chazarain, David Miller, ranma, gordonfarquharson, Marc Haber, Nick Piggin, andrei.popa, Linux Kernel Mailing List, Peter Zijlstra, Hugh Dickins, Florian Weimer, Martin Michlmayr, arjan, Chen Kenneth W Ok, with the ugly trace capture patch, I've actually captured this corruption in action, I think. I did a full trace of all pages involved in one run, and picked one corruption at random: Chunk 14465 corrupted (0-75) (01423fb4-01423fff) Expected 129, got 0 Written as (5126)9509(15017) That's the first 76 bytes of a chunk missing, and it's the last 76 bytes on a page. It's page index 01423 in the mapped file, and bytes fb4-fff within that file. There were four chunks written to that page: Writing chunk 14463/15800 (15%) (0142344c) (1) Writing chunk 14462/15800 (30%) (01422e98) (2) (overflows into 00001423) Writing chunk 14464/15800 (32%) (01423a00) (3) Writing chunk 14465/15800 (60%) (01423fb4) (4) <--- LOST! and the other three chunks checked out all right. And here's the annotated trace as it concerns that page: - here we write the first chunk to the page: ** (1) do_no_page: mapping index 00001423 at b7d1f44c (write) ** Setting page 00001423 dirty - something flushes it out to disk: ** cpd_for_io: index 00001423 ** cleaning index 00001423 at b7d1f000 - here we write the second chunk (which was split over the previous page and the interesting one): ** (2) Setting page 00001422 dirty ** (2) Setting page 00001423 dirty - and here we do a cleaning event ** cpd_for_io: index 00001423 ** cleaning index 00001423 at b7d1f000 - here we write the third chunk: ** (3) Setting page 00001423 dirty - here we write the fourth chunk: ** (4) NO DIRTY EVENT - and a third flush to disk: ** cpd_for_io: index 00001423 ** cleaning index 00001423 at b7d1f000 - here we unmap and flush: ** Unmapped index 00001423 at b7d1f000 ** Removing index 00001423 from page cache - here we remap to check: ** do_no_page: mapping index 00001423 at b7d1f000 (read) ** Unmapped index 00001423 at b7d1f000 - and finally, here I remove the file after the run: ** Removing index 00001423 from page cache Now, the important thing to see here is: - the missing write did not have a "Setting page 00001423 dirty" event associated with it. - but I can _see_ where the actual dirty event would be happening in the logs, because I can see the dirty events of the other chunk writes around it, so I know exactly where that fourth write happens. And indeed, it _shouldn't_ get a dirty event, because the page is still dirty from the write of chunk #3 to that page, which _did_ get a dirty event. I can see that, because the testing app writes the log of the pages it writes, and this is the log around the fourth and final write: ... Writing chunk 5338/15800 (60%) (0076eb48) PFN: 76e/76f Writing chunk 960/15800 (60%) (00156300) PFN: 156 Writing chunk 14465/15800 (60%) (01423fb4) <---- Writing chunk 8594/15800 (60%) (00bf74a8) PFN: bf7 Writing chunk 556/15800 (60%) (000c62f0) PFN: c6 Writing chunk 15190/15800 (60%) (01526678) PFN: 1526 ... and I can match this up with the full log from the kernel, which looks like this: Setting page 0000076e dirty Setting page 0000076f dirty Setting page 00000156 dirty Setting page 000000c6 dirty Setting page 00001526 dirty so I know exactly where the missing writes (to our page at pfn 1423, and the fpn-bf7 page) happened. - and the thing is, I can see a "cpd_for_io()" happening AFTER that fourth write. Quite a long while after, in fact. So all of this looks very fine indeed. We are not losing any dirty bits. - EVEN MORE INTERESTING: write 3 makes it onto disk, and it really uses the SAME dirty bit as write 4 did (which didn't make it out to disk!). The event that clears the dirty bit that write 3 did happens AFTER write 4 has happened! So if we're not losing any dirty bits, what's going on? I think we have some nasty interaction with the buffer heads. In particular, I don't think it's the dirty page bits that are broken (I _see_ that the PageDirty bit was set after write 4 was done to memory in the kernel traces). So I think that a real writeback just doesn't happen, because somebody has marked the buffer heads clean _after_ it started IO on them. I think "__mpage_writepage()" is buggy in this regard, for example. It even has a comment about its crapola behaviour: /* * Must try to add the page before marking the buffer clean or * the confused fail path above (OOM) will be very confused when * it finds all bh marked clean (i.e. it will not write anything) */ however, I don't think that particular thing explains it, because I don't think we use that function for the cases I'm looking at. Anyway, I'll add tracing for page-writeback setting/cleaning too, in case I might see anything new there.. Linus ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-28 22:37 ` Linus Torvalds @ 2006-12-28 22:50 ` David Miller 2006-12-28 23:01 ` Linus Torvalds 2006-12-29 1:38 ` Linus Torvalds 2006-12-28 23:36 ` Anton Altaparmakov 1 sibling, 2 replies; 311+ messages in thread From: David Miller @ 2006-12-28 22:50 UTC (permalink / raw) To: torvalds Cc: akpm, guichaz, ranma, gordonfarquharson, mh+linux-kernel, nickpiggin, andrei.popa, linux-kernel, a.p.zijlstra, hugh, fw, tbm, arjan, kenneth.w.chen From: Linus Torvalds <torvalds@osdl.org> Date: Thu, 28 Dec 2006 14:37:37 -0800 (PST) > So if we're not losing any dirty bits, what's going on? What happens when we writeback, to the PTEs? page_mkclean_file() iterates the VMAs and when it finds a shared one it goes: entry = ptep_clear_flush(vma, address, pte); entry = pte_wrprotect(entry); entry = pte_mkclean(entry); and that's fine, but that PTE is still marked writable, and I think that's key. What does the fault path do in this situation? if (write_access) { if (!pte_write(entry)) return do_wp_page(mm, vma, address, pte, pmd, ptl, entry); entry = pte_mkdirty(entry); } It does nothing to update the page dirty state, because it's writable, it just sets the PTE dirty bit and that's it. Should it be setting the page dirty here for SHARED cases? So until vmscan actually unmaps the PTE completely, we have this window in which the application can write to the PTE and the page dirty state doesn't get updated. Perhaps something later cleans up after this, f.e. by rechecking the PTE dirty bit at the end of I/O or when vmscan unmaps the page. I guess that should handle things, but the above logic definitely stood out to me. ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-28 22:50 ` David Miller @ 2006-12-28 23:01 ` Linus Torvalds 2006-12-29 1:38 ` Linus Torvalds 1 sibling, 0 replies; 311+ messages in thread From: Linus Torvalds @ 2006-12-28 23:01 UTC (permalink / raw) To: David Miller Cc: akpm, guichaz, ranma, gordonfarquharson, mh+linux-kernel, nickpiggin, andrei.popa, linux-kernel, a.p.zijlstra, hugh, fw, tbm, arjan, kenneth.w.chen On Thu, 28 Dec 2006, David Miller wrote: > > What happens when we writeback, to the PTEs? Not a damn thing. We clear the PTE's _before_ we even start the write. The writeback does nothing to them. If the user dirties the page while writeback is in progress, we'll take the page fault and re-dirty it _again_. > page_mkclean_file() iterates the VMAs and when it finds a shared > one it goes: > > entry = ptep_clear_flush(vma, address, pte); > entry = pte_wrprotect(entry); > entry = pte_mkclean(entry); > > and that's fine, but that PTE is still marked writable, and > I think that's key. No it's not. It's right there. "pte_wrprotect(entry)". You even copied it yourself. > What does the fault path do in this situation? > > if (write_access) { > if (!pte_write(entry)) > return do_wp_page(mm, vma, address, > pte, pmd, ptl, entry); So we call "do_wp_page()", and that does everythign right. Linus ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-28 22:50 ` David Miller 2006-12-28 23:01 ` Linus Torvalds @ 2006-12-29 1:38 ` Linus Torvalds 2006-12-29 1:59 ` Andrew Morton 1 sibling, 1 reply; 311+ messages in thread From: Linus Torvalds @ 2006-12-29 1:38 UTC (permalink / raw) To: David Miller Cc: akpm, guichaz, ranma, gordonfarquharson, mh+linux-kernel, nickpiggin, andrei.popa, linux-kernel, a.p.zijlstra, hugh, fw, tbm, arjan, kenneth.w.chen [-- Attachment #1: Type: TEXT/PLAIN, Size: 9586 bytes --] Btw, much cleaned-up page tracing patch here, in case anybody cares (and "test.c" attached, although I don't think it changed since last time). The test.c output is a bit hard to read at times, since it will give offsets in bytes as hex (ie "00a77664" means page frame 00000a77, and byte 664h within that page), while the kernel output is obvioiusly the page indexes (but the page fault _addresses_ can contain information about the exact byte in a page, so you can match them up when some kernel event is related to a page fault). So both forms are necessary/logical, but it means that to match things up, you often need to ignore the last three hex digits of the address that "test.c" outputs. This one also adds traces for the tags and the writeback activity, but since I'm going out for birthday dinner, I won't have time to try to actually analyse the trace I have.. Which is why I'm sending it out, in the hope that somebody else is working on this corruption issue and is interested.. Linus ---- diff --git a/fs/buffer.c b/fs/buffer.c index 263f88e..f5e132a 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -722,6 +722,7 @@ int __set_page_dirty_buffers(struct page *page) set_buffer_dirty(bh); bh = bh->b_this_page; } while (bh != head); + PAGE_TRACE(page, "dirtied buffers"); } spin_unlock(&mapping->private_lock); @@ -734,6 +735,7 @@ int __set_page_dirty_buffers(struct page *page) __inc_zone_page_state(page, NR_FILE_DIRTY); task_io_account_write(PAGE_CACHE_SIZE); } + PAGE_TRACE(page, "setting TAG_DIRTY"); radix_tree_tag_set(&mapping->page_tree, page_index(page), PAGECACHE_TAG_DIRTY); } diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 350878a..0cf3dce 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -91,6 +91,14 @@ #define PG_nosave_free 18 /* Used for system suspend/resume */ #define PG_buddy 19 /* Page is free, on buddy lists */ +#define SetPageInteresting(page) set_bit(PG_arch_1, &(page)->flags) +#define PageInteresting(page) test_bit(PG_arch_1, &(page)->flags) + +#define PAGE_TRACE(page, msg, arg...) do { \ + if (PageInteresting(page)) \ + printk(KERN_DEBUG "PG %08lx: %s:%d " msg "\n", \ + (page)->index, __FILE__, __LINE__ ,##arg ); \ +} while (0) #if (BITS_PER_LONG > 32) /* @@ -183,32 +191,38 @@ static inline void SetPageUptodate(struct page *page) #define PageWriteback(page) test_bit(PG_writeback, &(page)->flags) #define SetPageWriteback(page) \ do { \ - if (!test_and_set_bit(PG_writeback, \ - &(page)->flags)) \ + if (!test_and_set_bit(PG_writeback, &(page)->flags)) { \ + PAGE_TRACE(page, "set writeback"); \ inc_zone_page_state(page, NR_WRITEBACK); \ + } \ } while (0) #define TestSetPageWriteback(page) \ ({ \ int ret; \ ret = test_and_set_bit(PG_writeback, \ &(page)->flags); \ - if (!ret) \ + if (!ret) { \ + PAGE_TRACE(page, "set writeback"); \ inc_zone_page_state(page, NR_WRITEBACK); \ + } \ ret; \ }) #define ClearPageWriteback(page) \ do { \ - if (test_and_clear_bit(PG_writeback, \ - &(page)->flags)) \ + if (test_and_clear_bit(PG_writeback, &(page)->flags)) { \ + PAGE_TRACE(page, "end writeback"); \ dec_zone_page_state(page, NR_WRITEBACK); \ + } \ } while (0) #define TestClearPageWriteback(page) \ ({ \ int ret; \ ret = test_and_clear_bit(PG_writeback, \ &(page)->flags); \ - if (ret) \ + if (ret) { \ + PAGE_TRACE(page, "end writeback"); \ dec_zone_page_state(page, NR_WRITEBACK); \ + } \ ret; \ }) diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 5c26818..7735b83 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -79,7 +79,7 @@ config DEBUG_KERNEL config LOG_BUF_SHIFT int "Kernel log buffer size (16 => 64KB, 17 => 128KB)" if DEBUG_KERNEL - range 12 21 + range 12 24 default 17 if S390 || LOCKDEP default 16 if X86_NUMAQ || IA64 default 15 if SMP diff --git a/mm/filemap.c b/mm/filemap.c index 8332c77..4df7d35 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -116,6 +116,7 @@ void __remove_from_page_cache(struct page *page) { struct address_space *mapping = page->mapping; + PAGE_TRACE(page, "Removing page cache"); radix_tree_delete(&mapping->page_tree, page->index); page->mapping = NULL; mapping->nrpages--; @@ -421,6 +422,23 @@ int filemap_write_and_wait_range(struct address_space *mapping, return err; } +static noinline int is_interesting(struct address_space *mapping) +{ + struct inode *inode = mapping->host; + struct dentry *dentry; + int retval = 0; + + spin_lock(&dcache_lock); + list_for_each_entry(dentry, &inode->i_dentry, d_alias) { + if (strcmp(dentry->d_name.name, "mapfile")) + continue; + retval = 1; + break; + } + spin_unlock(&dcache_lock); + return retval; +} + /** * add_to_page_cache - add newly allocated pagecache pages * @page: page to add @@ -439,6 +457,9 @@ int add_to_page_cache(struct page *page, struct address_space *mapping, { int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM); + if (is_interesting(mapping)) + SetPageInteresting(page); + if (error == 0) { write_lock_irq(&mapping->tree_lock); error = radix_tree_insert(&mapping->page_tree, offset, page); diff --git a/mm/memory.c b/mm/memory.c index 563792f..20af32f 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -667,6 +667,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, tlb_remove_tlb_entry(tlb, pte, addr); if (unlikely(!page)) continue; + PAGE_TRACE(page, "unmapped at %08lx", addr); if (unlikely(details) && details->nonlinear_vma && linear_page_index(details->nonlinear_vma, addr) != page->index) @@ -1605,6 +1606,7 @@ gotten: */ ptep_clear_flush(vma, address, page_table); set_pte_at(mm, address, page_table, entry); + PAGE_TRACE(new_page, "write fault at %08lx", address); update_mmu_cache(vma, address, entry); lru_cache_add_active(new_page); page_add_new_anon_rmap(new_page, vma, address); @@ -2249,6 +2251,7 @@ retry: entry = mk_pte(new_page, vma->vm_page_prot); if (write_access) entry = maybe_mkwrite(pte_mkdirty(entry), vma); + PAGE_TRACE(new_page, "mapping at %08lx (%s)", address, write_access ? "write" : "read"); set_pte_at(mm, address, page_table, entry); if (anon) { inc_mm_counter(mm, anon_rss); diff --git a/mm/page-writeback.c b/mm/page-writeback.c index b3a198c..15f3aaf 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -773,6 +773,7 @@ int __set_page_dirty_nobuffers(struct page *page) __inc_zone_page_state(page, NR_FILE_DIRTY); task_io_account_write(PAGE_CACHE_SIZE); } + PAGE_TRACE(page, "setting TAG_DIRTY"); radix_tree_tag_set(&mapping->page_tree, page_index(page), PAGECACHE_TAG_DIRTY); } @@ -813,6 +814,7 @@ int fastcall set_page_dirty(struct page *page) if (!spd) spd = __set_page_dirty_buffers; #endif + PAGE_TRACE(page, "setting dirty"); return (*spd)(page); } if (!PageDirty(page)) { @@ -867,6 +869,7 @@ int clear_page_dirty_for_io(struct page *page) if (TestClearPageDirty(page)) { if (mapping_cap_account_dirty(mapping)) { + PAGE_TRACE(page, "clean_for_io"); page_mkclean(page); dec_zone_page_state(page, NR_FILE_DIRTY); } @@ -886,10 +889,12 @@ int test_clear_page_writeback(struct page *page) write_lock_irqsave(&mapping->tree_lock, flags); ret = TestClearPageWriteback(page); - if (ret) + if (ret) { + PAGE_TRACE(page, "clearing TAG_WRITEBACK"); radix_tree_tag_clear(&mapping->page_tree, page_index(page), PAGECACHE_TAG_WRITEBACK); + } write_unlock_irqrestore(&mapping->tree_lock, flags); } else { ret = TestClearPageWriteback(page); @@ -907,14 +912,18 @@ int test_set_page_writeback(struct page *page) write_lock_irqsave(&mapping->tree_lock, flags); ret = TestSetPageWriteback(page); - if (!ret) + if (!ret) { + PAGE_TRACE(page, "setting TAG_WRITEBACK"); radix_tree_tag_set(&mapping->page_tree, page_index(page), PAGECACHE_TAG_WRITEBACK); - if (!PageDirty(page)) + } + if (!PageDirty(page)) { + PAGE_TRACE(page, "clearing TAG_DIRTY"); radix_tree_tag_clear(&mapping->page_tree, page_index(page), PAGECACHE_TAG_DIRTY); + } write_unlock_irqrestore(&mapping->tree_lock, flags); } else { ret = TestSetPageWriteback(page); diff --git a/mm/rmap.c b/mm/rmap.c index 57306fa..e6b4676 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -448,6 +448,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma) if (pte_dirty(*pte) || pte_write(*pte)) { pte_t entry; + PAGE_TRACE(page, "cleaning PTE %08lx", address); flush_cache_page(vma, address, pte_pfn(*pte)); entry = ptep_clear_flush(vma, address, pte); entry = pte_wrprotect(entry); @@ -637,6 +638,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, goto out_unmap; } + PAGE_TRACE(page, "unmapping from %08lx", address); /* Nuke the page table entry. */ flush_cache_page(vma, address, page_to_pfn(page)); pteval = ptep_clear_flush(vma, address, pte); @@ -767,6 +769,7 @@ static void try_to_unmap_cluster(unsigned long cursor, if (ptep_clear_flush_young(vma, address, pte)) continue; + PAGE_TRACE(page, "unmapping from %08lx", address); /* Nuke the page table entry. */ flush_cache_page(vma, address, pte_pfn(*pte)); pteval = ptep_clear_flush(vma, address, pte); [-- Attachment #2: Type: TEXT/PLAIN, Size: 2975 bytes --] #include <sys/mman.h> #include <sys/fcntl.h> #include <unistd.h> #include <stdlib.h> #include <string.h> #include <stdio.h> #include <time.h> #define TARGETSIZE (22 << 20) #define CHUNKSIZE (1460) #define NRCHUNKS (TARGETSIZE / CHUNKSIZE) #define SIZE (NRCHUNKS * CHUNKSIZE) static void fillmem(void *start, int nr) { memset(start, nr, CHUNKSIZE); } #define page_offset(buf, off) (unsigned)((unsigned long)(buf)+(off)-(unsigned long)(mapping)) static int chunkorder[NRCHUNKS]; static char *mapping; static int order(int nr) { int i; if (nr < 0 || nr >= NRCHUNKS) return -1; for (i = 0; i < NRCHUNKS; i++) if (chunkorder[i] == nr) return i; return -2; } static void checkmem(void *buf, int nr) { unsigned int start = ~0u, end = 0; unsigned char c = nr, *p = buf, differs = 0; int i; for (i = 0; i < CHUNKSIZE; i++) { unsigned char got = *p++; if (got != c) { if (i < start) start = i; if (i > end) end = i; differs = got; } } if (start < end) { printf("Chunk %d corrupted (%u-%u) (%x-%x) \n", nr, start, end, page_offset(buf, start), page_offset(buf, end)); printf("Expected %u, got %u\n", c, differs); printf("Written as (%d)%d(%d)\n", order(nr-1), order(nr), order(nr+1)); } } static char *remap(int fd, char *mapping) { if (mapping) { munmap(mapping, SIZE); posix_fadvise(fd, 0, SIZE, POSIX_FADV_DONTNEED); } return mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); } int main(int argc, char **argv) { int fd, i; /* * Make some random ordering of writing the chunks to the * memory map.. * * Start with fully ordered.. */ for (i = 0; i < NRCHUNKS; i++) chunkorder[i] = i; /* ..and then mix it up randomly */ srandom(time(NULL)); for (i = 0; i < NRCHUNKS; i++) { int index = (unsigned int) random() % NRCHUNKS; int nr = chunkorder[index]; chunkorder[index] = chunkorder[i]; chunkorder[i] = nr; } fd = open("mapfile", O_RDWR | O_TRUNC | O_CREAT, 0666); if (fd < 0) return -1; if (ftruncate(fd, SIZE) < 0) return -1; mapping = remap(fd, NULL); if (-1 == (int)(long)mapping) return -1; for (i = 0; i < NRCHUNKS; i++) { int chunk = chunkorder[i]; printf("Writing chunk %d/%d (%d%%) (%08x) \r", chunk, NRCHUNKS, 100*i/NRCHUNKS, page_offset(mapping, chunk * CHUNKSIZE)); fillmem(mapping + chunk * CHUNKSIZE, chunk); } printf("\n"); /* Unmap, drop, and remap.. */ mapping = remap(fd, mapping); /* .. and check */ for (i = 0; i < NRCHUNKS; i++) { int chunk = i; printf("Checking chunk %d/%d (%d%%) (%08x) \r", i, NRCHUNKS, 100*i/NRCHUNKS, page_offset(mapping, i * CHUNKSIZE)); checkmem(mapping + chunk * CHUNKSIZE, chunk); } printf("\n"); /* Clean up for next time */ sleep(5); sync(); sleep(5); munmap(mapping, SIZE); close(fd); unlink("mapfile"); return 0; } ^ permalink raw reply related [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-29 1:38 ` Linus Torvalds @ 2006-12-29 1:59 ` Andrew Morton 0 siblings, 0 replies; 311+ messages in thread From: Andrew Morton @ 2006-12-29 1:59 UTC (permalink / raw) To: Linus Torvalds Cc: David Miller, guichaz, ranma, gordonfarquharson, mh+linux-kernel, nickpiggin, andrei.popa, linux-kernel, a.p.zijlstra, hugh, fw, tbm, arjan, kenneth.w.chen On Thu, 28 Dec 2006 17:38:38 -0800 (PST) Linus Torvalds <torvalds@osdl.org> wrote: > in > the hope that somebody else is working on this corruption issue and is > interested.. What corruption issue? ;) I'm finding that the corruption happens trivially with your test app, but apparently doesn't happen at all with ext2 or ext3, data=writeback. Maybe it will happen with increased rarity, but the difference is quite stark. Removing the err = walk_page_buffers(handle, page_bufs, 0, PAGE_CACHE_SIZE, NULL, journal_dirty_data_fn); from ext3_ordered_writepage() fixes things up. The things which journal_submit_data_buffers() does after dropping all the locks are ... disturbing - I don't think we have sufficient tests in there to ensure that the buffer is still where we think it is after we retake locks (they're slippery little buggers). But that wouldn't explain it anyway. It's inefficient that journal_dirty_data() will put these locked, clean buffers onto BJ_SyncData instead of BJ_Locked, but journal_submit_data_buffers() seems to dtrt with them. So no theory yet. Maybe ext3 is just altering timing. But the difference is really large.. Disabling all the WB_SYNC_NONE stuff and making everything go synchronous everywhere has no effect. Disabling bdi_write_congested() has no effect. ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-28 22:37 ` Linus Torvalds 2006-12-28 22:50 ` David Miller @ 2006-12-28 23:36 ` Anton Altaparmakov 2006-12-28 23:54 ` Linus Torvalds 1 sibling, 1 reply; 311+ messages in thread From: Anton Altaparmakov @ 2006-12-28 23:36 UTC (permalink / raw) To: Linus Torvalds Cc: Andrew Morton, Guillaume Chazarain, David Miller, ranma, gordonfarquharson, Marc Haber, Nick Piggin, andrei.popa, Linux Kernel Mailing List, Peter Zijlstra, Hugh Dickins, Florian Weimer, Martin Michlmayr, arjan, Chen Kenneth W On Thu, 28 Dec 2006, Linus Torvalds wrote: > Ok, > with the ugly trace capture patch, I've actually captured this corruption > in action, I think. > > I did a full trace of all pages involved in one run, and picked one > corruption at random: > > Chunk 14465 corrupted (0-75) (01423fb4-01423fff) > Expected 129, got 0 > Written as (5126)9509(15017) > > That's the first 76 bytes of a chunk missing, and it's the last 76 bytes > on a page. It's page index 01423 in the mapped file, and bytes fb4-fff > within that file. > > There were four chunks written to that page: > > Writing chunk 14463/15800 (15%) (0142344c) (1) > Writing chunk 14462/15800 (30%) (01422e98) (2) (overflows into 00001423) > Writing chunk 14464/15800 (32%) (01423a00) (3) > Writing chunk 14465/15800 (60%) (01423fb4) (4) <--- LOST! > > and the other three chunks checked out all right. > > And here's the annotated trace as it concerns that page: > > - here we write the first chunk to the page: > ** (1) do_no_page: mapping index 00001423 at b7d1f44c (write) > ** Setting page 00001423 dirty > > - something flushes it out to disk: > ** cpd_for_io: index 00001423 > ** cleaning index 00001423 at b7d1f000 > > - here we write the second chunk (which was split over the previous page > and the interesting one): > ** (2) Setting page 00001422 dirty > ** (2) Setting page 00001423 dirty > > - and here we do a cleaning event > ** cpd_for_io: index 00001423 > ** cleaning index 00001423 at b7d1f000 > > - here we write the third chunk: > ** (3) Setting page 00001423 dirty > > - here we write the fourth chunk: > ** (4) NO DIRTY EVENT > > - and a third flush to disk: > ** cpd_for_io: index 00001423 > ** cleaning index 00001423 at b7d1f000 > > - here we unmap and flush: > ** Unmapped index 00001423 at b7d1f000 > ** Removing index 00001423 from page cache > > - here we remap to check: > ** do_no_page: mapping index 00001423 at b7d1f000 (read) > ** Unmapped index 00001423 at b7d1f000 > > - and finally, here I remove the file after the run: > ** Removing index 00001423 from page cache > > Now, the important thing to see here is: > > - the missing write did not have a "Setting page 00001423 dirty" event > associated with it. > > - but I can _see_ where the actual dirty event would be happening in the > logs, because I can see the dirty events of the other chunk writes > around it, so I know exactly where that fourth write happens. And > indeed, it _shouldn't_ get a dirty event, because the page is still > dirty from the write of chunk #3 to that page, which _did_ get a dirty > event. > > I can see that, because the testing app writes the log of the pages it > writes, and this is the log around the fourth and final write: > > ... > Writing chunk 5338/15800 (60%) (0076eb48) PFN: 76e/76f > Writing chunk 960/15800 (60%) (00156300) PFN: 156 > Writing chunk 14465/15800 (60%) (01423fb4) <---- > Writing chunk 8594/15800 (60%) (00bf74a8) PFN: bf7 > Writing chunk 556/15800 (60%) (000c62f0) PFN: c6 > Writing chunk 15190/15800 (60%) (01526678) PFN: 1526 > ... > > and I can match this up with the full log from the kernel, which looks > like this: > > Setting page 0000076e dirty > Setting page 0000076f dirty > Setting page 00000156 dirty > Setting page 000000c6 dirty > Setting page 00001526 dirty > > so I know exactly where the missing writes (to our page at pfn 1423, > and the fpn-bf7 page) happened. > > - and the thing is, I can see a "cpd_for_io()" happening AFTER that > fourth write. Quite a long while after, in fact. So all of this looks > very fine indeed. We are not losing any dirty bits. > > - EVEN MORE INTERESTING: write 3 makes it onto disk, and it really uses > the SAME dirty bit as write 4 did (which didn't make it out to disk!). > The event that clears the dirty bit that write 3 did happens AFTER > write 4 has happened! > > So if we're not losing any dirty bits, what's going on? > > I think we have some nasty interaction with the buffer heads. In But are chunks 3 and 4 in separate buffer heads? Sorry could not see it immediately from the output you showed... It is just that there may be a different cause rather than buffer dirty state... A shot in the dark I know but it could perhaps be that a "COW for MAP_PRIVATE" like event happens when the page is dirty already thus the second write never actually makes it to the shared page thus it never gets written out. I am almost certainly totally barking up the wrong tree but I thought it may be worth mentioning just in case there was a slip in the COW logic or page writable state maintenance somewhere... Best regards, Anton > particular, I don't think it's the dirty page bits that are broken (I > _see_ that the PageDirty bit was set after write 4 was done to memory in > the kernel traces). So I think that a real writeback just doesn't happen, > because somebody has marked the buffer heads clean _after_ it started IO > on them. > > I think "__mpage_writepage()" is buggy in this regard, for example. It > even has a comment about its crapola behaviour: > > /* > * Must try to add the page before marking the buffer clean or > * the confused fail path above (OOM) will be very confused when > * it finds all bh marked clean (i.e. it will not write anything) > */ > > however, I don't think that particular thing explains it, because I don't > think we use that function for the cases I'm looking at. > > Anyway, I'll add tracing for page-writeback setting/cleaning too, in case > I might see anything new there.. > > Linus -- Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @) Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK Linux NTFS maintainer, http://www.linux-ntfs.org/ ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-28 23:36 ` Anton Altaparmakov @ 2006-12-28 23:54 ` Linus Torvalds 0 siblings, 0 replies; 311+ messages in thread From: Linus Torvalds @ 2006-12-28 23:54 UTC (permalink / raw) To: Anton Altaparmakov Cc: Andrew Morton, Guillaume Chazarain, David Miller, ranma, gordonfarquharson, Marc Haber, Nick Piggin, andrei.popa, Linux Kernel Mailing List, Peter Zijlstra, Hugh Dickins, Florian Weimer, Martin Michlmayr, arjan, Chen Kenneth W On Thu, 28 Dec 2006, Anton Altaparmakov wrote: > > But are chunks 3 and 4 in separate buffer heads? Sorry could not see it > immediately from the output you showed... No, this is a 4kB filesystem. A single bh per page. > It is just that there may be a different cause rather than buffer dirty > state... Sure. > A shot in the dark I know but it could perhaps be that a "COW for > MAP_PRIVATE" like event happens when the page is dirty already thus the > second write never actually makes it to the shared page thus it never gets > written out. There are no private mappings anywhere, and no forks. Just a single mmap (well, we unmap and remap in order to force the page cache to be invalidated properly with the posix_fadvise() thing, but that's literally the only user). Linus ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-28 19:00 ` Linus Torvalds 2006-12-28 19:05 ` Petri Kaukasoina 2006-12-28 21:24 ` Linus Torvalds @ 2006-12-29 17:49 ` Guillaume Chazarain 2 siblings, 0 replies; 311+ messages in thread From: Guillaume Chazarain @ 2006-12-29 17:49 UTC (permalink / raw) To: Linus Torvalds Cc: Marc Haber, Andrew Morton, Nick Piggin, andrei.popa, Linux Kernel Mailing List, Peter Zijlstra, Hugh Dickins, Florian Weimer, Martin Michlmayr Linus Torvalds a écrit : > going back to Linux-2.6.5 at least, according to one tester). > I apologize for the confusion, but it just occurred to me that I was actually experiencing a totally different problem: I set a root filesystem of 3Mib for qemu, so the test program just didn't have enough space for its file. -- Guillaume ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-18 4:51 ` Nick Piggin 2006-12-18 5:43 ` Andrew Morton @ 2006-12-18 5:50 ` Linus Torvalds 2006-12-18 7:16 ` Andrew Morton ` (2 more replies) 1 sibling, 3 replies; 311+ messages in thread From: Linus Torvalds @ 2006-12-18 5:50 UTC (permalink / raw) To: Nick Piggin Cc: Andrew Morton, andrei.popa, Linux Kernel Mailing List, Peter Zijlstra, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Mon, 18 Dec 2006, Nick Piggin wrote: > > I can't see how that's exactly a problem -- so long as the page does not > get reclaimed (it won't, because we have a ref on it) then all that matters > is that the page eventually gets marked dirty. But the point being that "try_to_free_buffers()" marks it clean AFTERWARDS. So yes, the page gets marked dirty in the pte's - the hardware generally does that for us, so we don't have to worry about that part going on. But "try_to_free_buffers()" seems to clear those dirty bits without serializing it really any way. It just says "ok, I will now clear them". Without knowing whether the dirty bits got set before the IO that cleared the buffer head dirty bits or not. What is _that_ serialization? As far as I can see, the only way to guarantee that to happen (since the dirty bits in the page tables will get set without us ever even being notified) is that the page tables themselves must simply never contain that page in a writable form at all. And that seems to be lacking. Anyway, I have what I consider a much simpler solution: just don't DO all that crap in try_to_free_buffers() at all. I sent it out to some people already, not not very widely. I reproduce my suggestion here for you (and maybe others too who weren't cc'd in that other discussion group) to comment on.. Linus --- So I think your patch is really broken, how about this one instead? It's really my previous patch, BUT it also adds a if (PageDirty(page) .. return 0; case, on the assumption that since PageDirty() measn that one of the buffers should be dirty, there's no point in even _trying_ drop_buffers, since that should just fail anyway. Now, that assumption is obviously wrong _if_ the buffers have been cleaned by something else. So in that case, we now don't remove the buffer heads, but who really cares? The page will remain on the dirty list, and something should be trying to write it out, but since now all the buffers are clean, once that happens, there is no actual IO to happen. Hmm? So this means that we simply don't remove the buffers early from such pages, but there shouldn't be any real downside. Now, the only question would be if the page is marked dirty _while_ this is running. We do hold the page lock, but page dirtying doesn't get the lock, does it? But at least we won't mark the page _clean_ when it shouldn't be.. And we still are atomic wrt the actual buffer lists (mapping->private_lock), so I think this should all be ok, and drop_buffers() will do the right thing. So no race possible either. At least as far as I can see. And the patch certainly is simple. Now the question whether this actually _fixes_ any problems does remain, but I think this should be a pretty good solution if the bug really is here. Andrew? Linus ---- diff --git a/fs/buffer.c b/fs/buffer.c index d1f1b54..263f88e 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *page) int ret = 0; BUG_ON(!PageLocked(page)); - if (PageWriteback(page)) + if (PageDirty(page) || PageWriteback(page)) return 0; if (mapping == NULL) { /* can this still happen? */ @@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *page) spin_lock(&mapping->private_lock); ret = drop_buffers(page, &buffers_to_free); spin_unlock(&mapping->private_lock); - if (ret) { - /* - * If the filesystem writes its buffers by hand (eg ext3) - * then we can have clean buffers against a dirty page. We - * clean the page here; otherwise later reattachment of buffers - * could encounter a non-uptodate page, which is unresolvable. - * This only applies in the rare case where try_to_free_buffers - * succeeds but the page is not freed. - * - * Also, during truncate, discard_buffer will have marked all - * the page's buffers clean. We discover that here and clean - * the page also. - */ - if (test_clear_page_dirty(page)) - task_io_account_cancelled_write(PAGE_CACHE_SIZE); - } out: if (buffers_to_free) { struct buffer_head *bh = buffers_to_free; ^ permalink raw reply related [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-18 5:50 ` Linus Torvalds @ 2006-12-18 7:16 ` Andrew Morton 2006-12-18 7:17 ` Andrew Morton 2006-12-18 9:30 ` Nick Piggin 2006-12-18 7:30 ` Nick Piggin 2006-12-18 9:19 ` Andrei Popa 2 siblings, 2 replies; 311+ messages in thread From: Andrew Morton @ 2006-12-18 7:16 UTC (permalink / raw) To: Linus Torvalds Cc: Nick Piggin, andrei.popa, Linux Kernel Mailing List, Peter Zijlstra, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Sun, 17 Dec 2006 21:50:43 -0800 (PST) Linus Torvalds <torvalds@osdl.org> wrote: > > > On Mon, 18 Dec 2006, Nick Piggin wrote: > > > > I can't see how that's exactly a problem -- so long as the page does not > > get reclaimed (it won't, because we have a ref on it) then all that matters > > is that the page eventually gets marked dirty. > > But the point being that "try_to_free_buffers()" marks it clean > AFTERWARDS. > > So yes, the page gets marked dirty in the pte's - the hardware generally > does that for us, so we don't have to worry about that part going on. > > But "try_to_free_buffers()" seems to clear those dirty bits without > serializing it really any way. It just says "ok, I will now clear them". > Without knowing whether the dirty bits got set before the IO that cleared > the buffer head dirty bits or not. Yes, I can't see anything correct about the current behaviour. But I'm going blue in the face here trying to feed try_to_free_buffers() a page_mapped(page), without success. pagevec_strip() presumably isn't triggering. > What is _that_ serialization? As far as I can see, the only way to > guarantee that to happen (since the dirty bits in the page tables will get > set without us ever even being notified) is that the page tables > themselves must simply never contain that page in a writable form at all. > > And that seems to be lacking. > > Anyway, I have what I consider a much simpler solution: just don't DO all > that crap in try_to_free_buffers() at all. I sent it out to some people > already, not not very widely. > > I reproduce my suggestion here for you (and maybe others too who weren't > cc'd in that other discussion group) to comment on.. > > ... > > --- a/fs/buffer.c > +++ b/fs/buffer.c > @@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *page) > int ret = 0; > > BUG_ON(!PageLocked(page)); > - if (PageWriteback(page)) > + if (PageDirty(page) || PageWriteback(page)) > return 0; > > if (mapping == NULL) { /* can this still happen? */ > @@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *page) > spin_lock(&mapping->private_lock); > ret = drop_buffers(page, &buffers_to_free); > spin_unlock(&mapping->private_lock); > - if (ret) { > - /* > - * If the filesystem writes its buffers by hand (eg ext3) > - * then we can have clean buffers against a dirty page. We > - * clean the page here; otherwise later reattachment of buffers > - * could encounter a non-uptodate page, which is unresolvable. > - * This only applies in the rare case where try_to_free_buffers > - * succeeds but the page is not freed. > - * > - * Also, during truncate, discard_buffer will have marked all > - * the page's buffers clean. We discover that here and clean > - * the page also. > - */ > - if (test_clear_page_dirty(page)) > - task_io_account_cancelled_write(PAGE_CACHE_SIZE); > - } > out: > if (buffers_to_free) { > struct buffer_head *bh = buffers_to_free; This will (at least) cause truncate to do peculiar things. do_invalidatepage() runs discard_buffer() against the dirty page and will then expect try_to_free_buffers() to remove those buffers and then clean the page. truncate_complete_page() will clean the page, but it still has those invalidated buffers. We'll end up with a large number of clean, unused pages on the LRU, with attached buffers. These should eventually get reaped, but it'll change the page aging dynamics. ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-18 7:16 ` Andrew Morton @ 2006-12-18 7:17 ` Andrew Morton 2006-12-18 9:30 ` Nick Piggin 1 sibling, 0 replies; 311+ messages in thread From: Andrew Morton @ 2006-12-18 7:17 UTC (permalink / raw) To: Linus Torvalds, Nick Piggin, andrei.popa, Linux Kernel Mailing List, Peter Zijlstra, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Sun, 17 Dec 2006 23:16:17 -0800 Andrew Morton <akpm@osdl.org> wrote: > > out: > > if (buffers_to_free) { > > struct buffer_head *bh = buffers_to_free; > > This will (at least) cause truncate to do peculiar things. > do_invalidatepage() runs discard_buffer() against the dirty page and will > then expect try_to_free_buffers() to remove those buffers and then clean > the page. truncate_complete_page() will clean the page, but it still has > those invalidated buffers. We'll end up with a large number of clean, > unused pages on the LRU, with attached buffers. These should eventually > get reaped, but it'll change the page aging dynamics. That being said, it's be great to get this tested by someone who can trigger this bug, please. ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-18 7:16 ` Andrew Morton 2006-12-18 7:17 ` Andrew Morton @ 2006-12-18 9:30 ` Nick Piggin 1 sibling, 0 replies; 311+ messages in thread From: Nick Piggin @ 2006-12-18 9:30 UTC (permalink / raw) To: Andrew Morton Cc: Linus Torvalds, andrei.popa, Linux Kernel Mailing List, Peter Zijlstra, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr Andrew Morton wrote: > On Sun, 17 Dec 2006 21:50:43 -0800 (PST) > Linus Torvalds <torvalds@osdl.org> wrote: > > >> >>On Mon, 18 Dec 2006, Nick Piggin wrote: >> >>>I can't see how that's exactly a problem -- so long as the page does not >>>get reclaimed (it won't, because we have a ref on it) then all that matters >>>is that the page eventually gets marked dirty. >> >>But the point being that "try_to_free_buffers()" marks it clean >>AFTERWARDS. >> >>So yes, the page gets marked dirty in the pte's - the hardware generally >>does that for us, so we don't have to worry about that part going on. >> >>But "try_to_free_buffers()" seems to clear those dirty bits without >>serializing it really any way. It just says "ok, I will now clear them". >>Without knowing whether the dirty bits got set before the IO that cleared >>the buffer head dirty bits or not. > > > Yes, I can't see anything correct about the current behaviour. > > But I'm going blue in the face here trying to feed try_to_free_buffers() a > page_mapped(page), without success. pagevec_strip() presumably isn't > triggering. I can trigger it here, with a kernel patch to call pagevec_strip unconditionally. I am seeing it clearing pte dirty bits, which is surely a dataloss bug. BUG: warning at mm/page-writeback.c:862/clear_page_dirty_warn() [<c013f65a>] clear_page_dirty_warn+0xdb/0xdd [<c0174309>] try_to_free_buffers+0x6b/0x7e [<c01937ec>] ext3_releasepage+0x0/0x74 [<c013bb48>] try_to_release_page+0x2c/0x40 [<c0140925>] pagevec_strip+0x52/0x54 [<c0141580>] shrink_active_list+0x2a0/0x3c8 [<c0142100>] shrink_zone+0xcd/0xea [<c014266d>] kswapd+0x311/0x41e [<c012c6aa>] autoremove_wake_function+0x0/0x37 [<c014235c>] kswapd+0x0/0x41e [<c012c527>] kthread+0xde/0xe2 [<c012c449>] kthread+0x0/0xe2 [<c010395b>] kernel_thread_helper+0x7/0x1c ======================= (clear_page_dirty_warn() is test_clear_page_dirty which WARN_ON()s the result of page_mkclean) > This will (at least) cause truncate to do peculiar things. > do_invalidatepage() runs discard_buffer() against the dirty page and will > then expect try_to_free_buffers() to remove those buffers and then clean > the page. truncate_complete_page() will clean the page, but it still has > those invalidated buffers. We'll end up with a large number of clean, > unused pages on the LRU, with attached buffers. These should eventually > get reaped, but it'll change the page aging dynamics. This isn't so nice. I wonder if you could just ClearPageDirty before calling try_to_free_buffers in this case, or is that too much of a hack? Ideally I guess you want a variant that is happy to discard dirtiness (alternatively, my proposal to redirty the page if we find a dirty pte should also handle this). -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-18 5:50 ` Linus Torvalds 2006-12-18 7:16 ` Andrew Morton @ 2006-12-18 7:30 ` Nick Piggin 2006-12-18 9:19 ` Andrei Popa 2 siblings, 0 replies; 311+ messages in thread From: Nick Piggin @ 2006-12-18 7:30 UTC (permalink / raw) To: Linus Torvalds Cc: Andrew Morton, andrei.popa, Linux Kernel Mailing List, Peter Zijlstra, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr Linus Torvalds wrote: > > On Mon, 18 Dec 2006, Nick Piggin wrote: > >>I can't see how that's exactly a problem -- so long as the page does not >>get reclaimed (it won't, because we have a ref on it) then all that matters >>is that the page eventually gets marked dirty. > > > But the point being that "try_to_free_buffers()" marks it clean > AFTERWARDS. For some reason I thought you were suggesting it is a problem on its own :P Yes I agree there is a pagefault vs ttfb race. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-18 5:50 ` Linus Torvalds 2006-12-18 7:16 ` Andrew Morton 2006-12-18 7:30 ` Nick Piggin @ 2006-12-18 9:19 ` Andrei Popa 2006-12-18 9:38 ` Andrew Morton 2 siblings, 1 reply; 311+ messages in thread From: Andrei Popa @ 2006-12-18 9:19 UTC (permalink / raw) To: Linus Torvalds Cc: Nick Piggin, Andrew Morton, Linux Kernel Mailing List, Peter Zijlstra, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr I tried latest git with the patch from this email and it still get file content corruption. If I can help you further debug the problem tell me what to do. On Sun, 2006-12-17 at 21:50 -0800, Linus Torvalds wrote: > > On Mon, 18 Dec 2006, Nick Piggin wrote: > > > > I can't see how that's exactly a problem -- so long as the page does not > > get reclaimed (it won't, because we have a ref on it) then all that matters > > is that the page eventually gets marked dirty. > > But the point being that "try_to_free_buffers()" marks it clean > AFTERWARDS. > > So yes, the page gets marked dirty in the pte's - the hardware generally > does that for us, so we don't have to worry about that part going on. > > But "try_to_free_buffers()" seems to clear those dirty bits without > serializing it really any way. It just says "ok, I will now clear them". > Without knowing whether the dirty bits got set before the IO that cleared > the buffer head dirty bits or not. > > What is _that_ serialization? As far as I can see, the only way to > guarantee that to happen (since the dirty bits in the page tables will get > set without us ever even being notified) is that the page tables > themselves must simply never contain that page in a writable form at all. > > And that seems to be lacking. > > Anyway, I have what I consider a much simpler solution: just don't DO all > that crap in try_to_free_buffers() at all. I sent it out to some people > already, not not very widely. > > I reproduce my suggestion here for you (and maybe others too who weren't > cc'd in that other discussion group) to comment on.. > > Linus > > --- > > So I think your patch is really broken, how about this one instead? > > It's really my previous patch, BUT it also adds a > > if (PageDirty(page) .. > return 0; > > case, on the assumption that since PageDirty() measn that one of the > buffers should be dirty, there's no point in even _trying_ drop_buffers, > since that should just fail anyway. > > Now, that assumption is obviously wrong _if_ the buffers have been cleaned > by something else. So in that case, we now don't remove the buffer heads, > but who really cares? The page will remain on the dirty list, and > something should be trying to write it out, but since now all the buffers > are clean, once that happens, there is no actual IO to happen. > > Hmm? So this means that we simply don't remove the buffers early from such > pages, but there shouldn't be any real downside. > > Now, the only question would be if the page is marked dirty _while_ this > is running. We do hold the page lock, but page dirtying doesn't get the > lock, does it? But at least we won't mark the page _clean_ when it > shouldn't be.. And we still are atomic wrt the actual buffer lists > (mapping->private_lock), so I think this should all be ok, and > drop_buffers() will do the right thing. > > So no race possible either. > > At least as far as I can see. And the patch certainly is simple. > > Now the question whether this actually _fixes_ any problems does remain, > but I think this should be a pretty good solution if the bug really is > here. Andrew? > > Linus > > ---- > diff --git a/fs/buffer.c b/fs/buffer.c > index d1f1b54..263f88e 100644 > --- a/fs/buffer.c > +++ b/fs/buffer.c > @@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *page) > int ret = 0; > > BUG_ON(!PageLocked(page)); > - if (PageWriteback(page)) > + if (PageDirty(page) || PageWriteback(page)) > return 0; > > if (mapping == NULL) { /* can this still happen? */ > @@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *page) > spin_lock(&mapping->private_lock); > ret = drop_buffers(page, &buffers_to_free); > spin_unlock(&mapping->private_lock); > - if (ret) { > - /* > - * If the filesystem writes its buffers by hand (eg ext3) > - * then we can have clean buffers against a dirty page. We > - * clean the page here; otherwise later reattachment of buffers > - * could encounter a non-uptodate page, which is unresolvable. > - * This only applies in the rare case where try_to_free_buffers > - * succeeds but the page is not freed. > - * > - * Also, during truncate, discard_buffer will have marked all > - * the page's buffers clean. We discover that here and clean > - * the page also. > - */ > - if (test_clear_page_dirty(page)) > - task_io_account_cancelled_write(PAGE_CACHE_SIZE); > - } > out: > if (buffers_to_free) { > struct buffer_head *bh = buffers_to_free; > ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-18 9:19 ` Andrei Popa @ 2006-12-18 9:38 ` Andrew Morton 2006-12-18 10:00 ` Andrei Popa 0 siblings, 1 reply; 311+ messages in thread From: Andrew Morton @ 2006-12-18 9:38 UTC (permalink / raw) To: andrei.popa Cc: Linus Torvalds, Nick Piggin, Linux Kernel Mailing List, Peter Zijlstra, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Mon, 18 Dec 2006 11:19:04 +0200 Andrei Popa <andrei.popa@i-neo.ro> wrote: > > I tried latest git with the patch from this email and it still get file > content corruption. If I can help you further debug the problem tell me > what to do. Can you please tell us all the steps which we need to take to reproduce this? ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-18 9:38 ` Andrew Morton @ 2006-12-18 10:00 ` Andrei Popa 2006-12-18 10:11 ` Peter Zijlstra 0 siblings, 1 reply; 311+ messages in thread From: Andrei Popa @ 2006-12-18 10:00 UTC (permalink / raw) To: Andrew Morton Cc: Linus Torvalds, Nick Piggin, Linux Kernel Mailing List, Peter Zijlstra, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Mon, 2006-12-18 at 01:38 -0800, Andrew Morton wrote: > On Mon, 18 Dec 2006 11:19:04 +0200 > Andrei Popa <andrei.popa@i-neo.ro> wrote: > > > > > I tried latest git with the patch from this email and it still get file > > content corruption. If I can help you further debug the problem tell me > > what to do. > > Can you please tell us all the steps which we need to take to reproduce this? I'm using rtorrent-0.7.0 and libtorrent-0.11.0, just download a torrent with multiple files(I downloaded 84 rar files) and when it will finish it will do a hash check and at the end of the check will say "Hash check on download completion found bad chunks, consider using "safe_sync"." and stop and most of the downloaded files are broken. With Peter Zijlstra patch this error doesn't show but there is file corruption(although less files are corrupted); afther the hash check, rtorrent will download the bad chunks and do another hash check and all files are ok. ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-18 10:00 ` Andrei Popa @ 2006-12-18 10:11 ` Peter Zijlstra 2006-12-18 10:49 ` Andrei Popa 0 siblings, 1 reply; 311+ messages in thread From: Peter Zijlstra @ 2006-12-18 10:11 UTC (permalink / raw) To: andrei.popa Cc: Andrew Morton, Linus Torvalds, Nick Piggin, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Mon, 2006-12-18 at 12:00 +0200, Andrei Popa wrote: > On Mon, 2006-12-18 at 01:38 -0800, Andrew Morton wrote: > > On Mon, 18 Dec 2006 11:19:04 +0200 > > Andrei Popa <andrei.popa@i-neo.ro> wrote: > > > > > > > > I tried latest git with the patch from this email and it still get file > > > content corruption. If I can help you further debug the problem tell me > > > what to do. > > > > Can you please tell us all the steps which we need to take to reproduce this? > > I'm using rtorrent-0.7.0 and libtorrent-0.11.0, just download a torrent > with multiple files(I downloaded 84 rar files) and when it will finish > it will do a hash check and at the end of the check will say "Hash check > on download completion found bad chunks, consider using "safe_sync"." > and stop and most of the downloaded files are broken. With Peter > Zijlstra patch this error doesn't show but there is file > corruption(although less files are corrupted); afther the hash check, > rtorrent will download the bad chunks and do another hash check and all > files are ok. OK, I'll try this on a ext3 box. BTW, what data mode are you using ext3 in? Also, for testings sake, could you give this a go: It's a total hack but I guess worth testing. --- mm/rmap.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6-git/mm/rmap.c =================================================================== --- linux-2.6-git.orig/mm/rmap.c 2006-12-18 11:06:29.000000000 +0100 +++ linux-2.6-git/mm/rmap.c 2006-12-18 11:07:16.000000000 +0100 @@ -448,7 +448,7 @@ static int page_mkclean_one(struct page goto unlock; entry = ptep_get_and_clear(mm, address, pte); - entry = pte_mkclean(entry); + /* entry = pte_mkclean(entry); */ entry = pte_wrprotect(entry); ptep_establish(vma, address, pte, entry); lazy_mmu_prot_update(entry); ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-18 10:11 ` Peter Zijlstra @ 2006-12-18 10:49 ` Andrei Popa 2006-12-18 15:24 ` Gene Heskett 0 siblings, 1 reply; 311+ messages in thread From: Andrei Popa @ 2006-12-18 10:49 UTC (permalink / raw) To: Peter Zijlstra Cc: Andrew Morton, Linus Torvalds, Nick Piggin, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr > OK, I'll try this on a ext3 box. BTW, what data mode are you using ext3 > in? > ordered > > Also, for testings sake, could you give this a go: > It's a total hack but I guess worth testing. > > --- > mm/rmap.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > Index: linux-2.6-git/mm/rmap.c > =================================================================== > --- linux-2.6-git.orig/mm/rmap.c 2006-12-18 11:06:29.000000000 +0100 > +++ linux-2.6-git/mm/rmap.c 2006-12-18 11:07:16.000000000 +0100 > @@ -448,7 +448,7 @@ static int page_mkclean_one(struct page > goto unlock; > > entry = ptep_get_and_clear(mm, address, pte); > - entry = pte_mkclean(entry); > + /* entry = pte_mkclean(entry); */ > entry = pte_wrprotect(entry); > ptep_establish(vma, address, pte, entry); > lazy_mmu_prot_update(entry); > with latest git and this patch there is no corruption ! ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-18 10:49 ` Andrei Popa @ 2006-12-18 15:24 ` Gene Heskett 2006-12-18 15:32 ` Peter Zijlstra 0 siblings, 1 reply; 311+ messages in thread From: Gene Heskett @ 2006-12-18 15:24 UTC (permalink / raw) To: linux-kernel, andrei.popa Cc: Peter Zijlstra, Andrew Morton, Linus Torvalds, Nick Piggin, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Monday 18 December 2006 05:49, Andrei Popa wrote: >> OK, I'll try this on a ext3 box. BTW, what data mode are you using >> ext3 in? > >ordered > >> Also, for testings sake, could you give this a go: >> It's a total hack but I guess worth testing. >> >> --- >> mm/rmap.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> Index: linux-2.6-git/mm/rmap.c >> =================================================================== >> --- linux-2.6-git.orig/mm/rmap.c 2006-12-18 11:06:29.000000000 +0100 >> +++ linux-2.6-git/mm/rmap.c 2006-12-18 11:07:16.000000000 +0100 >> @@ -448,7 +448,7 @@ static int page_mkclean_one(struct page >> goto unlock; >> >> entry = ptep_get_and_clear(mm, address, pte); >> - entry = pte_mkclean(entry); >> + /* entry = pte_mkclean(entry); */ >> entry = pte_wrprotect(entry); >> ptep_establish(vma, address, pte, entry); >> lazy_mmu_prot_update(entry); > >with latest git and this patch there is no corruption ! > I've not run a torrent app here recently. Should this patch be applied to a plain 2.6-20-rc1 before I do run azureas or similar apps? > > >- >To unsubscribe from this list: send the line "unsubscribe linux-kernel" > in the body of a message to majordomo@vger.kernel.org >More majordomo info at http://vger.kernel.org/majordomo-info.html >Please read the FAQ at http://www.tux.org/lkml/ -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) Yahoo.com and AOL/TW attorneys please note, additions to the above message by Gene Heskett are: Copyright 2006 by Maurice Eugene Heskett, all rights reserved. ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-18 15:24 ` Gene Heskett @ 2006-12-18 15:32 ` Peter Zijlstra 2006-12-18 15:47 ` Gene Heskett 0 siblings, 1 reply; 311+ messages in thread From: Peter Zijlstra @ 2006-12-18 15:32 UTC (permalink / raw) To: Gene Heskett Cc: linux-kernel, andrei.popa, Andrew Morton, Linus Torvalds, Nick Piggin, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Mon, 2006-12-18 at 10:24 -0500, Gene Heskett wrote: > On Monday 18 December 2006 05:49, Andrei Popa wrote: > >> OK, I'll try this on a ext3 box. BTW, what data mode are you using > >> ext3 in? > > > >ordered > > > >> Also, for testings sake, could you give this a go: > >> It's a total hack but I guess worth testing. > >> > >> --- > >> mm/rmap.c | 2 +- > >> 1 file changed, 1 insertion(+), 1 deletion(-) > >> > >> Index: linux-2.6-git/mm/rmap.c > >> =================================================================== > >> --- linux-2.6-git.orig/mm/rmap.c 2006-12-18 11:06:29.000000000 +0100 > >> +++ linux-2.6-git/mm/rmap.c 2006-12-18 11:07:16.000000000 +0100 > >> @@ -448,7 +448,7 @@ static int page_mkclean_one(struct page > >> goto unlock; > >> > >> entry = ptep_get_and_clear(mm, address, pte); > >> - entry = pte_mkclean(entry); > >> + /* entry = pte_mkclean(entry); */ > >> entry = pte_wrprotect(entry); > >> ptep_establish(vma, address, pte, entry); > >> lazy_mmu_prot_update(entry); > > > >with latest git and this patch there is no corruption ! > > > I've not run a torrent app here recently. Should this patch be applied to > a plain 2.6-20-rc1 before I do run azureas or similar apps? depends on what the blue frog does, if it uses MAP_SHARED like rtorrent does then yeah, probably. This patch really should not be the final one, I'm currently still trying to wrap my head around the issue. That said, it should be safe to use :-) ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-18 15:32 ` Peter Zijlstra @ 2006-12-18 15:47 ` Gene Heskett 0 siblings, 0 replies; 311+ messages in thread From: Gene Heskett @ 2006-12-18 15:47 UTC (permalink / raw) To: linux-kernel Cc: Peter Zijlstra, andrei.popa, Andrew Morton, Linus Torvalds, Nick Piggin, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Monday 18 December 2006 10:32, Peter Zijlstra wrote: [...] >> >> I've not run a torrent app here recently. Should this patch be >> applied to a plain 2.6-20-rc1 before I do run azureas or similar apps? > >depends on what the blue frog does, if it uses MAP_SHARED like rtorrent >does then yeah, probably. This patch really should not be the final one, >I'm currently still trying to wrap my head around the issue. That said, >it should be safe to use :-) > Thanks, I'll do it. -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) Yahoo.com and AOL/TW attorneys please note, additions to the above message by Gene Heskett are: Copyright 2006 by Maurice Eugene Heskett, all rights reserved. ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-17 23:40 ` Andrew Morton 2006-12-18 1:02 ` Linus Torvalds 2006-12-18 1:22 ` Linus Torvalds @ 2006-12-18 16:55 ` Peter Zijlstra 2006-12-18 18:03 ` Linus Torvalds 2 siblings, 1 reply; 311+ messages in thread From: Peter Zijlstra @ 2006-12-18 16:55 UTC (permalink / raw) To: Andrew Morton Cc: andrei.popa, Linux Kernel Mailing List, Hugh Dickins, Linus Torvalds, Florian Weimer, Marc Haber, Martin Michlmayr On Sun, 2006-12-17 at 15:40 -0800, Andrew Morton wrote: > On Sun, 17 Dec 2006 15:39:32 +0200 > Andrei Popa <andrei.popa@i-neo.ro> wrote: > > > I was mistaken, I'm still having file corruption with rtorrent. > > > > Well I'm not very optimistic, but if people could try this, please... > > > > From: Andrew Morton <akpm@osdl.org> > > try_to_free_buffers() clears the page's dirty state if it successfully removed > the page's buffers. > > Background for this: > > - a process does a one-byte-write to a file on a 64k pagesize, 4k > blocksize ext3 filesystem. The page is now PageDirty, !PgeUptodate and > has one dirty buffer and 15 not uptodate buffers. > > - kjournald writes the dirty buffer. The page is now PageDirty, > !PageUptodate and has a mix of clean and not uptodate buffers. > > - try_to_free_buffers() removes the page's buffers. It MUST now clear > PageDirty. If we were to leave the page dirty then we'd have a dirty, not > uptodate page with no buffer_heads. > > We're screwed: we cannot write the page because we don't know which > sections of it contain garbage. We cannot read the page because we don't > know which sections of it contain modified data. We cannot free the page > because it is dirty. > How about we stick something like this on top of that patch. It should preserve the dirty state as required. I tried to tinker with avoiding the clear/set thing but could not convince myself it was close to safe. This should be safe; page_mkclean walks the rmap and flips the pte's under the pte lock and records the dirty state while iterating. Concurrent faults will either do set_page_dirty() before we get around to doing it or vice versa, but dirty state is not lost. --- mm/page-writeback.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) Index: linux-2.6-git/mm/page-writeback.c =================================================================== --- linux-2.6-git.orig/mm/page-writeback.c 2006-12-18 17:24:41.000000000 +0100 +++ linux-2.6-git/mm/page-writeback.c 2006-12-18 17:26:56.000000000 +0100 @@ -872,8 +872,9 @@ int test_clear_page_dirty(struct page *p * page is locked, which pins the address_space */ if (mapping_cap_account_dirty(mapping)) { - if (must_clean_ptes) - page_mkclean(page); + int cleaned = page_mkclean(page); + if (!must_clean_ptes && cleaned) + set_page_dirty(page); dec_zone_page_state(page, NR_FILE_DIRTY); } return 1; ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-18 16:55 ` Peter Zijlstra @ 2006-12-18 18:03 ` Linus Torvalds 2006-12-18 18:24 ` Peter Zijlstra 2006-12-19 4:36 ` Nick Piggin 0 siblings, 2 replies; 311+ messages in thread From: Linus Torvalds @ 2006-12-18 18:03 UTC (permalink / raw) To: Peter Zijlstra Cc: Andrew Morton, andrei.popa, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr Andrei, could you try Peter's patch (on top of Andrew's patch - it depends on it, and wouldn't work on an unmodified -git kernel, but add the WARN_ON() I mention in this email? You seem to be able to reproduce this easily.. Thanks) On Mon, 18 Dec 2006, Peter Zijlstra wrote: > > This should be safe; page_mkclean walks the rmap and flips the pte's > under the pte lock and records the dirty state while iterating. > Concurrent faults will either do set_page_dirty() before we get around > to doing it or vice versa, but dirty state is not lost. Ok, I really liked this patch, but the more I thought about it, the more I started to doubt the reasons for liking it. I think we have some core fundamental problem here that this patch is needed at all. So let's think about this: we apparently have two cases of "clear_page_dirty()": - the one that really wants to clear the bit unconditionally (Andrew calls this the "must_clean_ptes" case, which I personally find to be a really confusing name, but whatever) - the other case. The case that doesn't want to really clear the pte dirty bits. and I thought your patch made sense, because it saved away the pte state in the page dirty state, and that matches my mental model, but the more I think about it, the less sense that whole "the other case" situation makes AT ALL. Why does "the other case" exist at all? If you want to clear the dirty page flag, what is _ever_ the reason for not wanting to drop PTE dirty information? In other words, what possible reason can there ever be for saying "I want this page to be clean", while at the same time saying "but if it was dirty in the page tables, don't forget about that state". So I absolutely detested Andrew's original patch, because that one made zero sense at all even from a code standpoint. With your patch on top, it all suddenly makes sense: at least you don't just leave dirty pages in the PTE's with a "struct page" that is marked clean, and the end result is undeniably at least _consistent_. So Andrew's patch I can't stand, because the whole point of it seems to be to leave the system in an inconsistent state (dirty in the pte's but marked "clean"), and if we want to have that state, then we should just revert _everything_ to the 2.6.18 situation, and not play these games at all. Andrew's patch with your patch on top makes me happy, because now we're at least honoring all the basic rules (we don't get into an inconsistent state), so on a local level it all makes sense. HOWEVER, I then don't actually understand how it could ever actually make sense to ask for "please clean the page, but don't actually clean it". So _I_ think that we should add a honking huge WARN_ON() for this case. Ie, do your patch, but instead of re-dirtying the page: + if (!must_clean_ptes && cleaned) + set_page_dirty(page); we would do + if (!must_clean_ptes && cleaned) { + WARN_ON(1); + set_page_dirty(page); + } and ask the people who see this problem to see if they get the WARN_ON() (assuming it _fixes_ their data corruption). Because whoever calls "clean_dirty_page()" without actually wanting to clean the PTE's is really a bug: those dirty PTE's had better not exist. Or maybe the WARN_ON() just points out _why_ somebody would want to do something this insane. Right now I just can't see why it's a valid thing to do. Maybe I'm still confused. Linus ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-18 18:03 ` Linus Torvalds @ 2006-12-18 18:24 ` Peter Zijlstra 2006-12-18 18:35 ` Linus Torvalds 2006-12-19 4:36 ` Nick Piggin 1 sibling, 1 reply; 311+ messages in thread From: Peter Zijlstra @ 2006-12-18 18:24 UTC (permalink / raw) To: Linus Torvalds Cc: Andrew Morton, andrei.popa, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Mon, 2006-12-18 at 10:03 -0800, Linus Torvalds wrote: > Andrei, > could you try Peter's patch (on top of Andrew's patch - it depends on > it, and wouldn't work on an unmodified -git kernel, but add the WARN_ON() > I mention in this email? You seem to be able to reproduce this easily.. > Thanks) I finally beat yum into submission and I hope to have rtorrent compiled shortly. > On Mon, 18 Dec 2006, Peter Zijlstra wrote: > > > > This should be safe; page_mkclean walks the rmap and flips the pte's > > under the pte lock and records the dirty state while iterating. > > Concurrent faults will either do set_page_dirty() before we get around > > to doing it or vice versa, but dirty state is not lost. > > Ok, I really liked this patch, but the more I thought about it, the more I > started to doubt the reasons for liking it. > > I think we have some core fundamental problem here that this patch is > needed at all. I agree, but I suspect this is like the buffered write deadlock Nick is working on, in that it will require some proper filesystem surgery to get right. Having the kernel working in the meantime has my preference ;-) > So let's think about this: we apparently have two cases of > "clear_page_dirty()": > > - the one that really wants to clear the bit unconditionally (Andrew > calls this the "must_clean_ptes" case, which I personally find to be a > really confusing name, but whatever) I'm probably worse with names so I'm not even going to try and fix that. > - the other case. The case that doesn't want to really clear the pte > dirty bits. > > and I thought your patch made sense, because it saved away the pte state > in the page dirty state, and that matches my mental model, but the more I > think about it, the less sense that whole "the other case" situation makes > AT ALL. > > Why does "the other case" exist at all? If you want to clear the dirty > page flag, what is _ever_ the reason for not wanting to drop PTE dirty > information? In other words, what possible reason can there ever be for > saying "I want this page to be clean", while at the same time saying "but > if it was dirty in the page tables, don't forget about that state". I have tried to get my head around this, and have so far failed. Andrews mail with the patch (great-grandparent to this mail) was the one that made most sense explaining it afaics. > So I absolutely detested Andrew's original patch, because that one made > zero sense at all even from a code standpoint. With your patch on top, it > all suddenly makes sense: at least you don't just leave dirty pages in the > PTE's with a "struct page" that is marked clean, and the end result is > undeniably at least _consistent_. > > So Andrew's patch I can't stand, because the whole point of it seems to be > to leave the system in an inconsistent state (dirty in the pte's but > marked "clean"), and if we want to have that state, then we should just > revert _everything_ to the 2.6.18 situation, and not play these games at > all. > > Andrew's patch with your patch on top makes me happy, because now we're > at least honoring all the basic rules (we don't get into an inconsistent > state), so on a local level it all makes sense. HOWEVER, I then don't > actually understand how it could ever actually make sense to ask for > "please clean the page, but don't actually clean it". Somehow it looses track of actual page content dirtyness when it does the page buffer game. Is this because page buffers are used to do sub-page sized writes without RMW cycles? Cannot this case be avoided when the page is mapped, because at that point the whole page will be resident anyway. > So _I_ think that we should add a honking huge WARN_ON() for this case. > Ie, do your patch, but instead of re-dirtying the page: > > + if (!must_clean_ptes && cleaned) > + set_page_dirty(page); > > we would do > > + if (!must_clean_ptes && cleaned) { > + WARN_ON(1); > + set_page_dirty(page); > + } > > and ask the people who see this problem to see if they get the WARN_ON() > (assuming it _fixes_ their data corruption). > > Because whoever calls "clean_dirty_page()" without actually wanting to > clean the PTE's is really a bug: those dirty PTE's had better not exist. > > Or maybe the WARN_ON() just points out _why_ somebody would want to do > something this insane. Right now I just can't see why it's a valid thing > to do. Maybe, but I think Nick's mail here: http://lkml.org/lkml/2006/12/18/59 shows a trace like that. I'm guessing that if we do the WARN_ON() some folks might get a lot of output, perhaps WARN_ON_ONCE() ? ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-18 18:24 ` Peter Zijlstra @ 2006-12-18 18:35 ` Linus Torvalds 2006-12-18 19:04 ` Andrei Popa 0 siblings, 1 reply; 311+ messages in thread From: Linus Torvalds @ 2006-12-18 18:35 UTC (permalink / raw) To: Peter Zijlstra Cc: Andrew Morton, andrei.popa, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Mon, 18 Dec 2006, Peter Zijlstra wrote: > > > > Or maybe the WARN_ON() just points out _why_ somebody would want to do > > something this insane. Right now I just can't see why it's a valid thing > > to do. > > Maybe, but I think Nick's mail here: > http://lkml.org/lkml/2006/12/18/59 > > shows a trace like that. Sure, but I actually think that "try_to_free_buffers()" was buggy in the first place, shouldn't have done what it did at all (it has NO business clearing dirty data), and should be fixed with my other simple and clean patch that just removes the crap. But sadly, Andrei said that he still saw data corruption, which implies that the problem had nothing to do with "try_to_free_buffers()" at all. (On that note: Andrei - if you do test this out, I'd suggest applying my patch too - the one that you already tested. It won't apply cleanly on top of Andrew's patch, but it should be trivial to apply by hand, since you really just want to remove the whole "if (ret) {...}" sequence. I realize that it didn't make any difference for you, but applying that patch is probably a good idea just to remove the noise for a codepath that you already showed to not matter) > I'm guessing that if we do the WARN_ON() some folks might get a lot of > output, perhaps WARN_ON_ONCE() ? Well, I'd rather get lots of noise to see all the paths that can cause this. We've been concentrating mainly on one (try_to_free_buffers()), but that one was already shown not to matter or at least not to be the _whole_ issue, so.. Linus ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-18 18:35 ` Linus Torvalds @ 2006-12-18 19:04 ` Andrei Popa 2006-12-18 19:10 ` Peter Zijlstra 2006-12-18 19:18 ` Linus Torvalds 0 siblings, 2 replies; 311+ messages in thread From: Andrei Popa @ 2006-12-18 19:04 UTC (permalink / raw) To: Linus Torvalds Cc: Peter Zijlstra, Andrew Morton, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr > (On that note: Andrei - if you do test this out, I'd suggest applying my > patch too - the one that you already tested. It won't apply cleanly on top > of Andrew's patch, but it should be trivial to apply by hand, since you > really just want to remove the whole "if (ret) {...}" sequence. I realize > that it didn't make any difference for you, but applying that patch is > probably a good idea just to remove the noise for a codepath that you > already showed to not matter) I applied Linus patch, Andrew patch, Peter Zijlstra patches(the last two). All unified patch is attached. I tested and I have no corruption. diff --git a/fs/buffer.c b/fs/buffer.c index d1f1b54..263f88e 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *pag int ret = 0; BUG_ON(!PageLocked(page)); - if (PageWriteback(page)) + if (PageDirty(page) || PageWriteback(page)) return 0; if (mapping == NULL) { /* can this still happen? */ @@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *pag spin_lock(&mapping->private_lock); ret = drop_buffers(page, &buffers_to_free); spin_unlock(&mapping->private_lock); - if (ret) { - /* - * If the filesystem writes its buffers by hand (eg ext3) - * then we can have clean buffers against a dirty page. We - * clean the page here; otherwise later reattachment of buffers - * could encounter a non-uptodate page, which is unresolvable. - * This only applies in the rare case where try_to_free_buffers - * succeeds but the page is not freed. - * - * Also, during truncate, discard_buffer will have marked all - * the page's buffers clean. We discover that here and clean - * the page also. - */ - if (test_clear_page_dirty(page)) - task_io_account_cancelled_write(PAGE_CACHE_SIZE); - } out: if (buffers_to_free) { struct buffer_head *bh = buffers_to_free; diff --git a/fs/cifs/file.c b/fs/cifs/file.c index 0f05cab..760442f 100644 --- a/fs/cifs/file.c +++ b/fs/cifs/file.c @@ -1245,7 +1245,7 @@ retry: wait_on_page_writeback(page); if (PageWriteback(page) || - !test_clear_page_dirty(page)) { + !test_clear_page_dirty(page, 1)) { unlock_page(page); break; } diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 1387749..da2bdb1 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -484,7 +484,7 @@ static int fuse_commit_write(struct file spin_unlock(&fc->lock); if (offset == 0 && to == PAGE_CACHE_SIZE) { - clear_page_dirty(page); + clear_page_dirty(page, 0); SetPageUptodate(page); } } diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index ed2c223..7b87875 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -176,7 +176,7 @@ static int hugetlbfs_commit_write(struct static void truncate_huge_page(struct page *page) { - clear_page_dirty(page); + clear_page_dirty(page, 1); ClearPageUptodate(page); remove_from_page_cache(page); put_page(page); diff --git a/fs/jfs/jfs_metapage.c b/fs/jfs/jfs_metapage.c index b1a1c72..47a6b62 100644 --- a/fs/jfs/jfs_metapage.c +++ b/fs/jfs/jfs_metapage.c @@ -773,7 +773,7 @@ #if MPS_PER_PAGE == 1 /* Retest mp->count since we may have released page lock */ if (test_bit(META_discard, &mp->flag) && !mp->count) { - clear_page_dirty(page); + clear_page_dirty(page, 1); ClearPageUptodate(page); } #else diff --git a/fs/reiserfs/stree.c b/fs/reiserfs/stree.c index 47e7027..a97e198 100644 --- a/fs/reiserfs/stree.c +++ b/fs/reiserfs/stree.c @@ -1459,7 +1459,7 @@ static void unmap_buffers(struct page *p bh = next; } while (bh != head); if (PAGE_SIZE == bh->b_size) { - clear_page_dirty(page); + clear_page_dirty(page, 0); } } } diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c index b56eb75..d65ba84 100644 --- a/fs/xfs/linux-2.6/xfs_aops.c +++ b/fs/xfs/linux-2.6/xfs_aops.c @@ -343,7 +343,7 @@ xfs_start_page_writeback( ASSERT(!PageWriteback(page)); set_page_writeback(page); if (clear_dirty) - clear_page_dirty(page); + clear_page_dirty(page, 1); unlock_page(page); if (!buffers) { end_page_writeback(page); diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 4830a3b..175ab3c 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -253,13 +253,13 @@ #define ClearPageUncached(page) clear_bi struct page; /* forward declaration */ -int test_clear_page_dirty(struct page *page); +int test_clear_page_dirty(struct page *page, int must_clean_ptes); int test_clear_page_writeback(struct page *page); int test_set_page_writeback(struct page *page); -static inline void clear_page_dirty(struct page *page) +static inline void clear_page_dirty(struct page *page, int must_clean_ptes) { - test_clear_page_dirty(page); + test_clear_page_dirty(page, must_clean_ptes); } static inline void set_page_writeback(struct page *page) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 237107c..561d702 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -848,7 +848,7 @@ EXPORT_SYMBOL(set_page_dirty_lock); * Clear a page's dirty flag, while caring for dirty memory accounting. * Returns true if the page was previously dirty. */ -int test_clear_page_dirty(struct page *page) +int test_clear_page_dirty(struct page *page, int must_clean_ptes) { struct address_space *mapping = page_mapping(page); unsigned long flags; @@ -866,7 +866,9 @@ int test_clear_page_dirty(struct page *p * page is locked, which pins the address_space */ if (mapping_cap_account_dirty(mapping)) { - page_mkclean(page); + int cleaned = page_mkclean(page); + if (!must_clean_ptes && cleaned) + set_page_dirty(page); dec_zone_page_state(page, NR_FILE_DIRTY); } return 1; diff --git a/mm/rmap.c b/mm/rmap.c index d8a842a..3f9061e 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -448,7 +448,7 @@ static int page_mkclean_one(struct page goto unlock; entry = ptep_get_and_clear(mm, address, pte); - entry = pte_mkclean(entry); + /*entry = pte_mkclean(entry);*/ entry = pte_wrprotect(entry); ptep_establish(vma, address, pte, entry); lazy_mmu_prot_update(entry); diff --git a/mm/truncate.c b/mm/truncate.c index 9bfb8e8..cafa843 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -70,7 +70,7 @@ truncate_complete_page(struct address_sp if (PagePrivate(page)) do_invalidatepage(page, 0); - if (test_clear_page_dirty(page)) + if (test_clear_page_dirty(page, 1)) task_io_account_cancelled_write(PAGE_CACHE_SIZE); ClearPageUptodate(page); ClearPageMappedToDisk(page); @@ -386,7 +386,7 @@ int invalidate_inode_pages2_range(struct PAGE_CACHE_SIZE, 0); } } - was_dirty = test_clear_page_dirty(page); + was_dirty = test_clear_page_dirty(page, 0); if (!invalidate_complete_page2(mapping, page)) { if (was_dirty) set_page_dirty(page); > > > I'm guessing that if we do the WARN_ON() some folks might get a lot of > > output, perhaps WARN_ON_ONCE() ? > > Well, I'd rather get lots of noise to see all the paths that can cause > this. We've been concentrating mainly on one (try_to_free_buffers()), but > that one was already shown not to matter or at least not to be the _whole_ > issue, so.. > > Linus ^ permalink raw reply related [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-18 19:04 ` Andrei Popa @ 2006-12-18 19:10 ` Peter Zijlstra 2006-12-18 19:18 ` Linus Torvalds 1 sibling, 0 replies; 311+ messages in thread From: Peter Zijlstra @ 2006-12-18 19:10 UTC (permalink / raw) To: andrei.popa Cc: Linus Torvalds, Andrew Morton, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Mon, 2006-12-18 at 21:04 +0200, Andrei Popa wrote: > diff --git a/mm/rmap.c b/mm/rmap.c > index d8a842a..3f9061e 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -448,7 +448,7 @@ static int page_mkclean_one(struct page > goto unlock; > > entry = ptep_get_and_clear(mm, address, pte); > - entry = pte_mkclean(entry); > + /*entry = pte_mkclean(entry);*/ > entry = pte_wrprotect(entry); > ptep_establish(vma, address, pte, entry); > lazy_mmu_prot_update(entry); please drop this chunk, this will always make the problem go away. ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-18 19:04 ` Andrei Popa 2006-12-18 19:10 ` Peter Zijlstra @ 2006-12-18 19:18 ` Linus Torvalds 2006-12-18 19:44 ` Andrei Popa 2006-12-19 7:38 ` Peter Zijlstra 1 sibling, 2 replies; 311+ messages in thread From: Linus Torvalds @ 2006-12-18 19:18 UTC (permalink / raw) To: Andrei Popa Cc: Peter Zijlstra, Andrew Morton, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Mon, 18 Dec 2006, Andrei Popa wrote: > > I applied Linus patch, Andrew patch, Peter Zijlstra patches(the last > two). All unified patch is attached. I tested and I have no corruption. That wasn't very interesting, because you also had the patch that just disabled "page_mkclean_one()" entirely: > diff --git a/mm/rmap.c b/mm/rmap.c > index d8a842a..3f9061e 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -448,7 +448,7 @@ static int page_mkclean_one(struct page > goto unlock; > > entry = ptep_get_and_clear(mm, address, pte); > - entry = pte_mkclean(entry); > + /*entry = pte_mkclean(entry);*/ > entry = pte_wrprotect(entry); > ptep_establish(vma, address, pte, entry); > lazy_mmu_prot_update(entry); The above patch is bad. It's always going to hide the bug, but it hides it by just not doing anything at all. So any patch combination that contains that patch will probably _always_ fix your problem, but it won't be an interesting patch.. So can you remove that small fragment? Also, it would be nice if you added the WARN_ON() to this sequence in mm/page-writeback.c: + if (!must_clean_ptes && cleaned) + set_page_dirty(page); just make it do a WARN_ON() if this ever triggers. Then, IF the corruption is gone, we'd love to see the WARN_ON results.. Linus ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-18 19:18 ` Linus Torvalds @ 2006-12-18 19:44 ` Andrei Popa 2006-12-18 20:14 ` Linus Torvalds 2006-12-19 7:38 ` Peter Zijlstra 1 sibling, 1 reply; 311+ messages in thread From: Andrei Popa @ 2006-12-18 19:44 UTC (permalink / raw) To: Linus Torvalds Cc: Peter Zijlstra, Andrew Morton, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Mon, 2006-12-18 at 11:18 -0800, Linus Torvalds wrote: > > On Mon, 18 Dec 2006, Andrei Popa wrote: > > > > I applied Linus patch, Andrew patch, Peter Zijlstra patches(the last > > two). All unified patch is attached. I tested and I have no corruption. > > That wasn't very interesting, because you also had the patch that just > disabled "page_mkclean_one()" entirely: > > > diff --git a/mm/rmap.c b/mm/rmap.c > > index d8a842a..3f9061e 100644 > > --- a/mm/rmap.c > > +++ b/mm/rmap.c > > @@ -448,7 +448,7 @@ static int page_mkclean_one(struct page > > goto unlock; > > > > entry = ptep_get_and_clear(mm, address, pte); > > - entry = pte_mkclean(entry); > > + /*entry = pte_mkclean(entry);*/ > > entry = pte_wrprotect(entry); > > ptep_establish(vma, address, pte, entry); > > lazy_mmu_prot_update(entry); > > The above patch is bad. It's always going to hide the bug, but it hides it > by just not doing anything at all. So any patch combination that contains > that patch will probably _always_ fix your problem, but it won't be an > interesting patch.. > > So can you remove that small fragment? Also, it would be nice if you added > the WARN_ON() to this sequence in mm/page-writeback.c: > > + if (!must_clean_ptes && cleaned) > + set_page_dirty(page); > > just make it do a WARN_ON() if this ever triggers. > > Then, IF the corruption is gone, we'd love to see the WARN_ON results.. > > Linus I dropped that patch and added WARN_ON(1), the unified patch is attached. I got corruption: "Hash check on download completion found bad chunks, consider using "safe_sync"." In dmesg there is no message from WARN_ON(1), my .config is attached. diff --git a/fs/buffer.c b/fs/buffer.c index d1f1b54..263f88e 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *pag int ret = 0; BUG_ON(!PageLocked(page)); - if (PageWriteback(page)) + if (PageDirty(page) || PageWriteback(page)) return 0; if (mapping == NULL) { /* can this still happen? */ @@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *pag spin_lock(&mapping->private_lock); ret = drop_buffers(page, &buffers_to_free); spin_unlock(&mapping->private_lock); - if (ret) { - /* - * If the filesystem writes its buffers by hand (eg ext3) - * then we can have clean buffers against a dirty page. We - * clean the page here; otherwise later reattachment of buffers - * could encounter a non-uptodate page, which is unresolvable. - * This only applies in the rare case where try_to_free_buffers - * succeeds but the page is not freed. - * - * Also, during truncate, discard_buffer will have marked all - * the page's buffers clean. We discover that here and clean - * the page also. - */ - if (test_clear_page_dirty(page)) - task_io_account_cancelled_write(PAGE_CACHE_SIZE); - } out: if (buffers_to_free) { struct buffer_head *bh = buffers_to_free; diff --git a/fs/cifs/file.c b/fs/cifs/file.c index 0f05cab..760442f 100644 --- a/fs/cifs/file.c +++ b/fs/cifs/file.c @@ -1245,7 +1245,7 @@ retry: wait_on_page_writeback(page); if (PageWriteback(page) || - !test_clear_page_dirty(page)) { + !test_clear_page_dirty(page, 1)) { unlock_page(page); break; } diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 1387749..da2bdb1 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -484,7 +484,7 @@ static int fuse_commit_write(struct file spin_unlock(&fc->lock); if (offset == 0 && to == PAGE_CACHE_SIZE) { - clear_page_dirty(page); + clear_page_dirty(page, 0); SetPageUptodate(page); } } diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index ed2c223..7b87875 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -176,7 +176,7 @@ static int hugetlbfs_commit_write(struct static void truncate_huge_page(struct page *page) { - clear_page_dirty(page); + clear_page_dirty(page, 1); ClearPageUptodate(page); remove_from_page_cache(page); put_page(page); diff --git a/fs/jfs/jfs_metapage.c b/fs/jfs/jfs_metapage.c index b1a1c72..47a6b62 100644 --- a/fs/jfs/jfs_metapage.c +++ b/fs/jfs/jfs_metapage.c @@ -773,7 +773,7 @@ #if MPS_PER_PAGE == 1 /* Retest mp->count since we may have released page lock */ if (test_bit(META_discard, &mp->flag) && !mp->count) { - clear_page_dirty(page); + clear_page_dirty(page, 1); ClearPageUptodate(page); } #else diff --git a/fs/reiserfs/stree.c b/fs/reiserfs/stree.c index 47e7027..a97e198 100644 --- a/fs/reiserfs/stree.c +++ b/fs/reiserfs/stree.c @@ -1459,7 +1459,7 @@ static void unmap_buffers(struct page *p bh = next; } while (bh != head); if (PAGE_SIZE == bh->b_size) { - clear_page_dirty(page); + clear_page_dirty(page, 0); } } } diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c index b56eb75..d65ba84 100644 --- a/fs/xfs/linux-2.6/xfs_aops.c +++ b/fs/xfs/linux-2.6/xfs_aops.c @@ -343,7 +343,7 @@ xfs_start_page_writeback( ASSERT(!PageWriteback(page)); set_page_writeback(page); if (clear_dirty) - clear_page_dirty(page); + clear_page_dirty(page, 1); unlock_page(page); if (!buffers) { end_page_writeback(page); diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 4830a3b..175ab3c 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -253,13 +253,13 @@ #define ClearPageUncached(page) clear_bi struct page; /* forward declaration */ -int test_clear_page_dirty(struct page *page); +int test_clear_page_dirty(struct page *page, int must_clean_ptes); int test_clear_page_writeback(struct page *page); int test_set_page_writeback(struct page *page); -static inline void clear_page_dirty(struct page *page) +static inline void clear_page_dirty(struct page *page, int must_clean_ptes) { - test_clear_page_dirty(page); + test_clear_page_dirty(page, must_clean_ptes); } static inline void set_page_writeback(struct page *page) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 237107c..f7e0cc8 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -848,7 +848,7 @@ EXPORT_SYMBOL(set_page_dirty_lock); * Clear a page's dirty flag, while caring for dirty memory accounting. * Returns true if the page was previously dirty. */ -int test_clear_page_dirty(struct page *page) +int test_clear_page_dirty(struct page *page, int must_clean_ptes) { struct address_space *mapping = page_mapping(page); unsigned long flags; @@ -866,7 +866,12 @@ int test_clear_page_dirty(struct page *p * page is locked, which pins the address_space */ if (mapping_cap_account_dirty(mapping)) { - page_mkclean(page); + int cleaned = page_mkclean(page); + if (!must_clean_ptes && cleaned){ + WARN_ON(1); + set_page_dirty(page); + } + dec_zone_page_state(page, NR_FILE_DIRTY); } return 1; diff --git a/mm/rmap.c b/mm/rmap.c diff --git a/mm/truncate.c b/mm/truncate.c index 9bfb8e8..cafa843 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -70,7 +70,7 @@ truncate_complete_page(struct address_sp if (PagePrivate(page)) do_invalidatepage(page, 0); - if (test_clear_page_dirty(page)) + if (test_clear_page_dirty(page, 1)) task_io_account_cancelled_write(PAGE_CACHE_SIZE); ClearPageUptodate(page); ClearPageMappedToDisk(page); @@ -386,7 +386,7 @@ int invalidate_inode_pages2_range(struct PAGE_CACHE_SIZE, 0); } } - was_dirty = test_clear_page_dirty(page); + was_dirty = test_clear_page_dirty(page, 0); if (!invalidate_complete_page2(mapping, page)) { if (was_dirty) set_page_dirty(page); # # Automatically generated make config: don't edit # Linux kernel version: 2.6.20-rc1 # Sun Dec 17 01:52:12 2006 # CONFIG_X86_32=y CONFIG_GENERIC_TIME=y CONFIG_LOCKDEP_SUPPORT=y CONFIG_STACKTRACE_SUPPORT=y CONFIG_SEMAPHORE_SLEEPERS=y CONFIG_X86=y CONFIG_MMU=y CONFIG_GENERIC_ISA_DMA=y CONFIG_GENERIC_IOMAP=y CONFIG_GENERIC_BUG=y CONFIG_GENERIC_HWEIGHT=y CONFIG_ARCH_MAY_HAVE_PC_FDC=y CONFIG_DMI=y CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config" # # Code maturity level options # CONFIG_EXPERIMENTAL=y CONFIG_LOCK_KERNEL=y CONFIG_INIT_ENV_ARG_LIMIT=32 # # General setup # CONFIG_LOCALVERSION="" # CONFIG_LOCALVERSION_AUTO is not set CONFIG_SWAP=y CONFIG_SYSVIPC=y # CONFIG_IPC_NS is not set # CONFIG_POSIX_MQUEUE is not set # CONFIG_BSD_PROCESS_ACCT is not set # CONFIG_TASKSTATS is not set # CONFIG_UTS_NS is not set # CONFIG_AUDIT is not set CONFIG_IKCONFIG=y # CONFIG_IKCONFIG_PROC is not set # CONFIG_CPUSETS is not set # CONFIG_SYSFS_DEPRECATED is not set # CONFIG_RELAY is not set CONFIG_INITRAMFS_SOURCE="" # CONFIG_CC_OPTIMIZE_FOR_SIZE is not set CONFIG_SYSCTL=y # CONFIG_EMBEDDED is not set CONFIG_UID16=y CONFIG_SYSCTL_SYSCALL=y CONFIG_KALLSYMS=y # CONFIG_KALLSYMS_ALL is not set # CONFIG_KALLSYMS_EXTRA_PASS is not set CONFIG_HOTPLUG=y CONFIG_PRINTK=y CONFIG_BUG=y CONFIG_ELF_CORE=y CONFIG_BASE_FULL=y CONFIG_FUTEX=y CONFIG_EPOLL=y CONFIG_SHMEM=y CONFIG_SLAB=y CONFIG_VM_EVENT_COUNTERS=y CONFIG_RT_MUTEXES=y # CONFIG_TINY_SHMEM is not set CONFIG_BASE_SMALL=0 # CONFIG_SLOB is not set # # Loadable module support # # CONFIG_MODULES is not set CONFIG_STOP_MACHINE=y # # Block layer # CONFIG_BLOCK=y # CONFIG_LBD is not set # CONFIG_BLK_DEV_IO_TRACE is not set # CONFIG_LSF is not set # # IO Schedulers # CONFIG_IOSCHED_NOOP=y CONFIG_IOSCHED_AS=y CONFIG_IOSCHED_DEADLINE=y CONFIG_IOSCHED_CFQ=y # CONFIG_DEFAULT_AS is not set # CONFIG_DEFAULT_DEADLINE is not set CONFIG_DEFAULT_CFQ=y # CONFIG_DEFAULT_NOOP is not set CONFIG_DEFAULT_IOSCHED="cfq" # # Processor type and features # CONFIG_SMP=y CONFIG_X86_PC=y # CONFIG_X86_ELAN is not set # CONFIG_X86_VOYAGER is not set # CONFIG_X86_NUMAQ is not set # CONFIG_X86_SUMMIT is not set # CONFIG_X86_BIGSMP is not set # CONFIG_X86_VISWS is not set # CONFIG_X86_GENERICARCH is not set # CONFIG_X86_ES7000 is not set # CONFIG_PARAVIRT is not set # CONFIG_M386 is not set # CONFIG_M486 is not set # CONFIG_M586 is not set # CONFIG_M586TSC is not set # CONFIG_M586MMX is not set # CONFIG_M686 is not set # CONFIG_MPENTIUMII is not set # CONFIG_MPENTIUMIII is not set CONFIG_MPENTIUMM=y # CONFIG_MCORE2 is not set # CONFIG_MPENTIUM4 is not set # CONFIG_MK6 is not set # CONFIG_MK7 is not set # CONFIG_MK8 is not set # CONFIG_MCRUSOE is not set # CONFIG_MEFFICEON is not set # CONFIG_MWINCHIPC6 is not set # CONFIG_MWINCHIP2 is not set # CONFIG_MWINCHIP3D is not set # CONFIG_MGEODEGX1 is not set # CONFIG_MGEODE_LX is not set # CONFIG_MCYRIXIII is not set # CONFIG_MVIAC3_2 is not set # CONFIG_X86_GENERIC is not set CONFIG_X86_CMPXCHG=y CONFIG_X86_XADD=y CONFIG_X86_L1_CACHE_SHIFT=6 CONFIG_RWSEM_XCHGADD_ALGORITHM=y # CONFIG_ARCH_HAS_ILOG2_U32 is not set # CONFIG_ARCH_HAS_ILOG2_U64 is not set CONFIG_GENERIC_CALIBRATE_DELAY=y CONFIG_X86_WP_WORKS_OK=y CONFIG_X86_INVLPG=y CONFIG_X86_BSWAP=y CONFIG_X86_POPAD_OK=y CONFIG_X86_CMPXCHG64=y CONFIG_X86_GOOD_APIC=y CONFIG_X86_INTEL_USERCOPY=y CONFIG_X86_USE_PPRO_CHECKSUM=y CONFIG_X86_TSC=y CONFIG_HPET_TIMER=y CONFIG_HPET_EMULATE_RTC=y CONFIG_NR_CPUS=8 # CONFIG_SCHED_SMT is not set CONFIG_SCHED_MC=y # CONFIG_PREEMPT_NONE is not set # CONFIG_PREEMPT_VOLUNTARY is not set CONFIG_PREEMPT=y CONFIG_PREEMPT_BKL=y CONFIG_X86_LOCAL_APIC=y CONFIG_X86_IO_APIC=y CONFIG_X86_MCE=y CONFIG_X86_MCE_NONFATAL=y CONFIG_X86_MCE_P4THERMAL=y CONFIG_VM86=y # CONFIG_TOSHIBA is not set # CONFIG_I8K is not set # CONFIG_X86_REBOOTFIXUPS is not set # CONFIG_MICROCODE is not set # CONFIG_X86_MSR is not set # CONFIG_X86_CPUID is not set # # Firmware Drivers # # CONFIG_EDD is not set # CONFIG_DELL_RBU is not set # CONFIG_DCDBAS is not set # CONFIG_NOHIGHMEM is not set CONFIG_HIGHMEM4G=y # CONFIG_HIGHMEM64G is not set CONFIG_PAGE_OFFSET=0xC0000000 CONFIG_HIGHMEM=y CONFIG_ARCH_FLATMEM_ENABLE=y CONFIG_ARCH_SPARSEMEM_ENABLE=y CONFIG_ARCH_SELECT_MEMORY_MODEL=y CONFIG_ARCH_POPULATES_NODE_MAP=y CONFIG_SELECT_MEMORY_MODEL=y CONFIG_FLATMEM_MANUAL=y # CONFIG_DISCONTIGMEM_MANUAL is not set # CONFIG_SPARSEMEM_MANUAL is not set CONFIG_FLATMEM=y CONFIG_FLAT_NODE_MEM_MAP=y CONFIG_SPARSEMEM_STATIC=y CONFIG_SPLIT_PTLOCK_CPUS=4 # CONFIG_RESOURCES_64BIT is not set # CONFIG_HIGHPTE is not set # CONFIG_MATH_EMULATION is not set CONFIG_MTRR=y # CONFIG_EFI is not set CONFIG_IRQBALANCE=y # CONFIG_SECCOMP is not set # CONFIG_HZ_100 is not set # CONFIG_HZ_250 is not set # CONFIG_HZ_300 is not set CONFIG_HZ_1000=y CONFIG_HZ=1000 # CONFIG_KEXEC is not set # CONFIG_CRASH_DUMP is not set # CONFIG_RELOCATABLE is not set CONFIG_PHYSICAL_ALIGN=0x100000 CONFIG_HOTPLUG_CPU=y # CONFIG_COMPAT_VDSO is not set CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y # # Power management options (ACPI, APM) # CONFIG_PM=y # CONFIG_PM_LEGACY is not set # CONFIG_PM_DEBUG is not set # CONFIG_PM_SYSFS_DEPRECATED is not set CONFIG_SOFTWARE_SUSPEND=y CONFIG_PM_STD_PARTITION="" CONFIG_SUSPEND_SMP=y # # ACPI (Advanced Configuration and Power Interface) Support # CONFIG_ACPI=y CONFIG_ACPI_SLEEP=y CONFIG_ACPI_SLEEP_PROC_FS=y # CONFIG_ACPI_SLEEP_PROC_SLEEP is not set CONFIG_ACPI_AC=y CONFIG_ACPI_BATTERY=y CONFIG_ACPI_BUTTON=y CONFIG_ACPI_VIDEO=y CONFIG_ACPI_HOTKEY=y CONFIG_ACPI_FAN=y # CONFIG_ACPI_DOCK is not set CONFIG_ACPI_PROCESSOR=y CONFIG_ACPI_HOTPLUG_CPU=y CONFIG_ACPI_THERMAL=y # CONFIG_ACPI_ASUS is not set # CONFIG_ACPI_IBM is not set # CONFIG_ACPI_TOSHIBA is not set # CONFIG_ACPI_CUSTOM_DSDT is not set CONFIG_ACPI_BLACKLIST_YEAR=0 # CONFIG_ACPI_DEBUG is not set CONFIG_ACPI_EC=y CONFIG_ACPI_POWER=y CONFIG_ACPI_SYSTEM=y CONFIG_X86_PM_TIMER=y CONFIG_ACPI_CONTAINER=y # # APM (Advanced Power Management) BIOS Support # # CONFIG_APM is not set # # CPU Frequency scaling # CONFIG_CPU_FREQ=y CONFIG_CPU_FREQ_TABLE=y # CONFIG_CPU_FREQ_DEBUG is not set CONFIG_CPU_FREQ_STAT=y # CONFIG_CPU_FREQ_STAT_DETAILS is not set CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE=y # CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set CONFIG_CPU_FREQ_GOV_PERFORMANCE=y CONFIG_CPU_FREQ_GOV_POWERSAVE=y CONFIG_CPU_FREQ_GOV_USERSPACE=y CONFIG_CPU_FREQ_GOV_ONDEMAND=y CONFIG_CPU_FREQ_GOV_CONSERVATIVE=y # # CPUFreq processor drivers # CONFIG_X86_ACPI_CPUFREQ=y # CONFIG_X86_POWERNOW_K6 is not set # CONFIG_X86_POWERNOW_K7 is not set # CONFIG_X86_POWERNOW_K8 is not set # CONFIG_X86_GX_SUSPMOD is not set CONFIG_X86_SPEEDSTEP_CENTRINO=y CONFIG_X86_SPEEDSTEP_CENTRINO_ACPI=y # CONFIG_X86_SPEEDSTEP_CENTRINO_TABLE is not set CONFIG_X86_SPEEDSTEP_ICH=y # CONFIG_X86_SPEEDSTEP_SMI is not set # CONFIG_X86_P4_CLOCKMOD is not set # CONFIG_X86_CPUFREQ_NFORCE2 is not set # CONFIG_X86_LONGRUN is not set # CONFIG_X86_LONGHAUL is not set # # shared options # # CONFIG_X86_ACPI_CPUFREQ_PROC_INTF is not set CONFIG_X86_SPEEDSTEP_LIB=y # CONFIG_X86_SPEEDSTEP_RELAXED_CAP_CHECK is not set # # Bus options (PCI, PCMCIA, EISA, MCA, ISA) # CONFIG_PCI=y # CONFIG_PCI_GOBIOS is not set # CONFIG_PCI_GOMMCONFIG is not set # CONFIG_PCI_GODIRECT is not set CONFIG_PCI_GOANY=y CONFIG_PCI_BIOS=y CONFIG_PCI_DIRECT=y CONFIG_PCI_MMCONFIG=y # CONFIG_PCIEPORTBUS is not set CONFIG_PCI_MSI=y # CONFIG_PCI_MULTITHREAD_PROBE is not set # CONFIG_PCI_DEBUG is not set # CONFIG_HT_IRQ is not set CONFIG_ISA_DMA_API=y # CONFIG_ISA is not set # CONFIG_MCA is not set # CONFIG_SCx200 is not set # # PCCARD (PCMCIA/CardBus) support # # CONFIG_PCCARD is not set # # PCI Hotplug Support # # CONFIG_HOTPLUG_PCI is not set # # Executable file formats # CONFIG_BINFMT_ELF=y CONFIG_BINFMT_AOUT=y CONFIG_BINFMT_MISC=y # # Networking # CONFIG_NET=y # # Networking options # # CONFIG_NETDEBUG is not set CONFIG_PACKET=y CONFIG_PACKET_MMAP=y CONFIG_UNIX=y # CONFIG_NET_KEY is not set CONFIG_INET=y # CONFIG_IP_MULTICAST is not set # CONFIG_IP_ADVANCED_ROUTER is not set CONFIG_IP_FIB_HASH=y # CONFIG_IP_PNP is not set # CONFIG_NET_IPIP is not set # CONFIG_NET_IPGRE is not set # CONFIG_ARPD is not set # CONFIG_SYN_COOKIES is not set # CONFIG_INET_AH is not set # CONFIG_INET_ESP is not set # CONFIG_INET_IPCOMP is not set # CONFIG_INET_XFRM_TUNNEL is not set # CONFIG_INET_TUNNEL is not set # CONFIG_INET_XFRM_MODE_TRANSPORT is not set # CONFIG_INET_XFRM_MODE_TUNNEL is not set # CONFIG_INET_XFRM_MODE_BEET is not set # CONFIG_INET_DIAG is not set # CONFIG_TCP_CONG_ADVANCED is not set CONFIG_TCP_CONG_CUBIC=y CONFIG_DEFAULT_TCP_CONG="cubic" # CONFIG_TCP_MD5SIG is not set # CONFIG_IPV6 is not set # CONFIG_INET6_XFRM_TUNNEL is not set # CONFIG_INET6_TUNNEL is not set # CONFIG_NETWORK_SECMARK is not set # CONFIG_NETFILTER is not set # # DCCP Configuration (EXPERIMENTAL) # # CONFIG_IP_DCCP is not set # # SCTP Configuration (EXPERIMENTAL) # # CONFIG_IP_SCTP is not set # # TIPC Configuration (EXPERIMENTAL) # # CONFIG_TIPC is not set # CONFIG_ATM is not set # CONFIG_BRIDGE is not set # CONFIG_VLAN_8021Q is not set # CONFIG_DECNET is not set # CONFIG_LLC2 is not set # CONFIG_IPX is not set # CONFIG_ATALK is not set # CONFIG_X25 is not set # CONFIG_LAPB is not set # CONFIG_ECONET is not set # CONFIG_WAN_ROUTER is not set # # QoS and/or fair queueing # # CONFIG_NET_SCHED is not set # # Network testing # # CONFIG_NET_PKTGEN is not set # CONFIG_HAMRADIO is not set # CONFIG_IRDA is not set CONFIG_BT=y CONFIG_BT_L2CAP=y CONFIG_BT_SCO=y CONFIG_BT_RFCOMM=y CONFIG_BT_RFCOMM_TTY=y CONFIG_BT_BNEP=y # CONFIG_BT_BNEP_MC_FILTER is not set # CONFIG_BT_BNEP_PROTO_FILTER is not set CONFIG_BT_HIDP=y # # Bluetooth device drivers # CONFIG_BT_HCIUSB=y # CONFIG_BT_HCIUSB_SCO is not set # CONFIG_BT_HCIUART is not set # CONFIG_BT_HCIBCM203X is not set # CONFIG_BT_HCIBPA10X is not set # CONFIG_BT_HCIBFUSB is not set # CONFIG_BT_HCIVHCI is not set # CONFIG_IEEE80211 is not set CONFIG_WIRELESS_EXT=y # # Device Drivers # # # Generic Driver Options # # CONFIG_STANDALONE is not set # CONFIG_PREVENT_FIRMWARE_BUILD is not set CONFIG_FW_LOADER=y # CONFIG_DEBUG_DRIVER is not set # CONFIG_SYS_HYPERVISOR is not set # # Connector - unified userspace <-> kernelspace linker # # CONFIG_CONNECTOR is not set # # Memory Technology Devices (MTD) # # CONFIG_MTD is not set # # Parallel port support # # CONFIG_PARPORT is not set # # Plug and Play support # CONFIG_PNP=y # CONFIG_PNP_DEBUG is not set # # Protocols # CONFIG_PNPACPI=y # # Block devices # CONFIG_BLK_DEV_FD=y # CONFIG_BLK_CPQ_DA is not set # CONFIG_BLK_CPQ_CISS_DA is not set # CONFIG_BLK_DEV_DAC960 is not set # CONFIG_BLK_DEV_UMEM is not set # CONFIG_BLK_DEV_COW_COMMON is not set CONFIG_BLK_DEV_LOOP=y # CONFIG_BLK_DEV_CRYPTOLOOP is not set # CONFIG_BLK_DEV_NBD is not set # CONFIG_BLK_DEV_SX8 is not set # CONFIG_BLK_DEV_UB is not set # CONFIG_BLK_DEV_RAM is not set # CONFIG_BLK_DEV_INITRD is not set # CONFIG_CDROM_PKTCDVD is not set # CONFIG_ATA_OVER_ETH is not set # # Misc devices # # CONFIG_IBM_ASM is not set # CONFIG_SGI_IOC4 is not set # CONFIG_TIFM_CORE is not set # CONFIG_MSI_LAPTOP is not set # # ATA/ATAPI/MFM/RLL support # CONFIG_IDE=y CONFIG_BLK_DEV_IDE=y # # Please see Documentation/ide.txt for help/info on IDE drives # # CONFIG_BLK_DEV_IDE_SATA is not set # CONFIG_BLK_DEV_HD_IDE is not set CONFIG_BLK_DEV_IDEDISK=y CONFIG_IDEDISK_MULTI_MODE=y CONFIG_BLK_DEV_IDECD=y # CONFIG_BLK_DEV_IDETAPE is not set # CONFIG_BLK_DEV_IDEFLOPPY is not set CONFIG_BLK_DEV_IDESCSI=y # CONFIG_IDE_TASK_IOCTL is not set # # IDE chipset support/bugfixes # CONFIG_IDE_GENERIC=y # CONFIG_BLK_DEV_CMD640 is not set # CONFIG_BLK_DEV_IDEPNP is not set CONFIG_BLK_DEV_IDEPCI=y CONFIG_IDEPCI_SHARE_IRQ=y # CONFIG_BLK_DEV_OFFBOARD is not set CONFIG_BLK_DEV_GENERIC=y # CONFIG_BLK_DEV_OPTI621 is not set # CONFIG_BLK_DEV_RZ1000 is not set CONFIG_BLK_DEV_IDEDMA_PCI=y # CONFIG_BLK_DEV_IDEDMA_FORCED is not set CONFIG_IDEDMA_PCI_AUTO=y # CONFIG_IDEDMA_ONLYDISK is not set # CONFIG_BLK_DEV_AEC62XX is not set # CONFIG_BLK_DEV_ALI15X3 is not set # CONFIG_BLK_DEV_AMD74XX is not set # CONFIG_BLK_DEV_ATIIXP is not set # CONFIG_BLK_DEV_CMD64X is not set # CONFIG_BLK_DEV_TRIFLEX is not set # CONFIG_BLK_DEV_CY82C693 is not set # CONFIG_BLK_DEV_CS5520 is not set # CONFIG_BLK_DEV_CS5530 is not set # CONFIG_BLK_DEV_CS5535 is not set # CONFIG_BLK_DEV_HPT34X is not set # CONFIG_BLK_DEV_HPT366 is not set # CONFIG_BLK_DEV_JMICRON is not set # CONFIG_BLK_DEV_SC1200 is not set CONFIG_BLK_DEV_PIIX=y # CONFIG_BLK_DEV_IT821X is not set # CONFIG_BLK_DEV_NS87415 is not set # CONFIG_BLK_DEV_PDC202XX_OLD is not set # CONFIG_BLK_DEV_PDC202XX_NEW is not set # CONFIG_BLK_DEV_SVWKS is not set # CONFIG_BLK_DEV_SIIMAGE is not set # CONFIG_BLK_DEV_SIS5513 is not set # CONFIG_BLK_DEV_SLC90E66 is not set # CONFIG_BLK_DEV_TRM290 is not set # CONFIG_BLK_DEV_VIA82CXXX is not set # CONFIG_IDE_ARM is not set CONFIG_BLK_DEV_IDEDMA=y # CONFIG_IDEDMA_IVB is not set CONFIG_IDEDMA_AUTO=y # CONFIG_BLK_DEV_HD is not set # # SCSI device support # # CONFIG_RAID_ATTRS is not set CONFIG_SCSI=y # CONFIG_SCSI_TGT is not set # CONFIG_SCSI_NETLINK is not set CONFIG_SCSI_PROC_FS=y # # SCSI support type (disk, tape, CD-ROM) # CONFIG_BLK_DEV_SD=y # CONFIG_CHR_DEV_ST is not set # CONFIG_CHR_DEV_OSST is not set CONFIG_BLK_DEV_SR=y # CONFIG_BLK_DEV_SR_VENDOR is not set CONFIG_CHR_DEV_SG=y # CONFIG_CHR_DEV_SCH is not set # # Some SCSI devices (e.g. CD jukebox) support multiple LUNs # CONFIG_SCSI_MULTI_LUN=y # CONFIG_SCSI_CONSTANTS is not set # CONFIG_SCSI_LOGGING is not set # CONFIG_SCSI_SCAN_ASYNC is not set # # SCSI Transports # # CONFIG_SCSI_SPI_ATTRS is not set # CONFIG_SCSI_FC_ATTRS is not set # CONFIG_SCSI_ISCSI_ATTRS is not set # CONFIG_SCSI_SAS_ATTRS is not set # CONFIG_SCSI_SAS_LIBSAS is not set # # SCSI low-level drivers # # CONFIG_ISCSI_TCP is not set # CONFIG_BLK_DEV_3W_XXXX_RAID is not set # CONFIG_SCSI_3W_9XXX is not set # CONFIG_SCSI_ACARD is not set # CONFIG_SCSI_AACRAID is not set # CONFIG_SCSI_AIC7XXX is not set # CONFIG_SCSI_AIC7XXX_OLD is not set # CONFIG_SCSI_AIC79XX is not set # CONFIG_SCSI_AIC94XX is not set # CONFIG_SCSI_DPT_I2O is not set # CONFIG_SCSI_ADVANSYS is not set # CONFIG_SCSI_ARCMSR is not set # CONFIG_MEGARAID_NEWGEN is not set # CONFIG_MEGARAID_LEGACY is not set # CONFIG_MEGARAID_SAS is not set # CONFIG_SCSI_HPTIOP is not set # CONFIG_SCSI_BUSLOGIC is not set # CONFIG_SCSI_DMX3191D is not set # CONFIG_SCSI_EATA is not set # CONFIG_SCSI_FUTURE_DOMAIN is not set # CONFIG_SCSI_GDTH is not set # CONFIG_SCSI_IPS is not set # CONFIG_SCSI_INITIO is not set # CONFIG_SCSI_INIA100 is not set # CONFIG_SCSI_STEX is not set # CONFIG_SCSI_SYM53C8XX_2 is not set # CONFIG_SCSI_IPR is not set # CONFIG_SCSI_QLOGIC_1280 is not set # CONFIG_SCSI_QLA_FC is not set # CONFIG_SCSI_QLA_ISCSI is not set # CONFIG_SCSI_LPFC is not set # CONFIG_SCSI_DC395x is not set # CONFIG_SCSI_DC390T is not set # CONFIG_SCSI_NSP32 is not set # CONFIG_SCSI_DEBUG is not set # CONFIG_SCSI_SRP is not set # # Serial ATA (prod) and Parallel ATA (experimental) drivers # CONFIG_ATA=y CONFIG_SATA_AHCI=y # CONFIG_SATA_SVW is not set CONFIG_ATA_PIIX=y # CONFIG_SATA_MV is not set # CONFIG_SATA_NV is not set # CONFIG_PDC_ADMA is not set # CONFIG_SATA_QSTOR is not set # CONFIG_SATA_PROMISE is not set # CONFIG_SATA_SX4 is not set # CONFIG_SATA_SIL is not set # CONFIG_SATA_SIL24 is not set # CONFIG_SATA_SIS is not set # CONFIG_SATA_ULI is not set # CONFIG_SATA_VIA is not set # CONFIG_SATA_VITESSE is not set CONFIG_SATA_INTEL_COMBINED=y # CONFIG_PATA_ALI is not set # CONFIG_PATA_AMD is not set # CONFIG_PATA_ARTOP is not set # CONFIG_PATA_ATIIXP is not set # CONFIG_PATA_CMD64X is not set # CONFIG_PATA_CS5520 is not set # CONFIG_PATA_CS5530 is not set # CONFIG_PATA_CS5535 is not set # CONFIG_PATA_CYPRESS is not set # CONFIG_PATA_EFAR is not set # CONFIG_ATA_GENERIC is not set # CONFIG_PATA_HPT366 is not set # CONFIG_PATA_HPT37X is not set # CONFIG_PATA_HPT3X2N is not set # CONFIG_PATA_HPT3X3 is not set # CONFIG_PATA_IT821X is not set # CONFIG_PATA_JMICRON is not set # CONFIG_PATA_TRIFLEX is not set # CONFIG_PATA_MARVELL is not set # CONFIG_PATA_MPIIX is not set # CONFIG_PATA_OLDPIIX is not set # CONFIG_PATA_NETCELL is not set # CONFIG_PATA_NS87410 is not set # CONFIG_PATA_OPTI is not set # CONFIG_PATA_OPTIDMA is not set # CONFIG_PATA_PDC_OLD is not set # CONFIG_PATA_RADISYS is not set # CONFIG_PATA_RZ1000 is not set # CONFIG_PATA_SC1200 is not set # CONFIG_PATA_SERVERWORKS is not set # CONFIG_PATA_PDC2027X is not set # CONFIG_PATA_SIL680 is not set # CONFIG_PATA_SIS is not set # CONFIG_PATA_VIA is not set # CONFIG_PATA_WINBOND is not set # # Multi-device support (RAID and LVM) # # CONFIG_MD is not set # # Fusion MPT device support # # CONFIG_FUSION is not set # CONFIG_FUSION_SPI is not set # CONFIG_FUSION_FC is not set # CONFIG_FUSION_SAS is not set # # IEEE 1394 (FireWire) support # CONFIG_IEEE1394=y # # Subsystem Options # # CONFIG_IEEE1394_VERBOSEDEBUG is not set # CONFIG_IEEE1394_OUI_DB is not set # CONFIG_IEEE1394_EXTRA_CONFIG_ROMS is not set # CONFIG_IEEE1394_EXPORT_FULL_API is not set # # Device Drivers # # # Texas Instruments PCILynx requires I2C # CONFIG_IEEE1394_OHCI1394=y # # Protocol Drivers # # CONFIG_IEEE1394_VIDEO1394 is not set CONFIG_IEEE1394_SBP2=y # CONFIG_IEEE1394_SBP2_PHYS_DMA is not set # CONFIG_IEEE1394_ETH1394 is not set # CONFIG_IEEE1394_DV1394 is not set CONFIG_IEEE1394_RAWIO=y # # I2O device support # # CONFIG_I2O is not set # # Network device support # CONFIG_NETDEVICES=y # CONFIG_DUMMY is not set # CONFIG_BONDING is not set # CONFIG_EQUALIZER is not set # CONFIG_TUN is not set # CONFIG_NET_SB1000 is not set # # ARCnet devices # # CONFIG_ARCNET is not set # # PHY device support # # CONFIG_PHYLIB is not set # # Ethernet (10 or 100Mbit) # CONFIG_NET_ETHERNET=y CONFIG_MII=y # CONFIG_HAPPYMEAL is not set # CONFIG_SUNGEM is not set # CONFIG_CASSINI is not set # CONFIG_NET_VENDOR_3COM is not set # # Tulip family network device support # # CONFIG_NET_TULIP is not set # CONFIG_HP100 is not set CONFIG_NET_PCI=y # CONFIG_PCNET32 is not set # CONFIG_AMD8111_ETH is not set # CONFIG_ADAPTEC_STARFIRE is not set # CONFIG_B44 is not set # CONFIG_FORCEDETH is not set # CONFIG_DGRS is not set # CONFIG_EEPRO100 is not set CONFIG_E100=y # CONFIG_FEALNX is not set # CONFIG_NATSEMI is not set # CONFIG_NE2K_PCI is not set # CONFIG_8139CP is not set # CONFIG_8139TOO is not set # CONFIG_SIS900 is not set # CONFIG_EPIC100 is not set # CONFIG_SUNDANCE is not set # CONFIG_TLAN is not set # CONFIG_VIA_RHINE is not set # # Ethernet (1000 Mbit) # # CONFIG_ACENIC is not set # CONFIG_DL2K is not set # CONFIG_E1000 is not set # CONFIG_NS83820 is not set # CONFIG_HAMACHI is not set # CONFIG_YELLOWFIN is not set # CONFIG_R8169 is not set # CONFIG_SIS190 is not set # CONFIG_SKGE is not set # CONFIG_SKY2 is not set # CONFIG_SK98LIN is not set # CONFIG_VIA_VELOCITY is not set # CONFIG_TIGON3 is not set # CONFIG_BNX2 is not set # CONFIG_QLA3XXX is not set # # Ethernet (10000 Mbit) # # CONFIG_CHELSIO_T1 is not set # CONFIG_IXGB is not set # CONFIG_S2IO is not set # CONFIG_MYRI10GE is not set # CONFIG_NETXEN_NIC is not set # # Token Ring devices # # CONFIG_TR is not set # # Wireless LAN (non-hamradio) # CONFIG_NET_RADIO=y # CONFIG_NET_WIRELESS_RTNETLINK is not set # # Obsolete Wireless cards support (pre-802.11) # # CONFIG_STRIP is not set # # Wireless 802.11b ISA/PCI cards support # # CONFIG_IPW2100 is not set # CONFIG_IPW2200 is not set # CONFIG_AIRO is not set # CONFIG_HERMES is not set # CONFIG_ATMEL is not set # # Prism GT/Duette 802.11(a/b/g) PCI/Cardbus support # # CONFIG_PRISM54 is not set # CONFIG_USB_ZD1201 is not set # CONFIG_HOSTAP is not set CONFIG_NET_WIRELESS=y # # Wan interfaces # # CONFIG_WAN is not set # CONFIG_FDDI is not set # CONFIG_HIPPI is not set # CONFIG_PPP is not set # CONFIG_SLIP is not set # CONFIG_NET_FC is not set # CONFIG_SHAPER is not set # CONFIG_NETCONSOLE is not set # CONFIG_NETPOLL is not set # CONFIG_NET_POLL_CONTROLLER is not set # # ISDN subsystem # # CONFIG_ISDN is not set # # Telephony Support # # CONFIG_PHONE is not set # # Input device support # CONFIG_INPUT=y # CONFIG_INPUT_FF_MEMLESS is not set # # Userland interfaces # CONFIG_INPUT_MOUSEDEV=y CONFIG_INPUT_MOUSEDEV_PSAUX=y CONFIG_INPUT_MOUSEDEV_SCREEN_X=1280 CONFIG_INPUT_MOUSEDEV_SCREEN_Y=800 # CONFIG_INPUT_JOYDEV is not set # CONFIG_INPUT_TSDEV is not set # CONFIG_INPUT_EVDEV is not set # CONFIG_INPUT_EVBUG is not set # # Input Device Drivers # CONFIG_INPUT_KEYBOARD=y CONFIG_KEYBOARD_ATKBD=y # CONFIG_KEYBOARD_SUNKBD is not set # CONFIG_KEYBOARD_LKKBD is not set # CONFIG_KEYBOARD_XTKBD is not set # CONFIG_KEYBOARD_NEWTON is not set # CONFIG_KEYBOARD_STOWAWAY is not set CONFIG_INPUT_MOUSE=y CONFIG_MOUSE_PS2=y # CONFIG_MOUSE_SERIAL is not set # CONFIG_MOUSE_VSXXXAA is not set # CONFIG_INPUT_JOYSTICK is not set # CONFIG_INPUT_TOUCHSCREEN is not set CONFIG_INPUT_MISC=y # CONFIG_INPUT_PCSPKR is not set CONFIG_INPUT_WISTRON_BTNS=y # CONFIG_INPUT_UINPUT is not set # # Hardware I/O ports # CONFIG_SERIO=y CONFIG_SERIO_I8042=y # CONFIG_SERIO_SERPORT is not set # CONFIG_SERIO_CT82C710 is not set # CONFIG_SERIO_PCIPS2 is not set CONFIG_SERIO_LIBPS2=y # CONFIG_SERIO_RAW is not set # CONFIG_GAMEPORT is not set # # Character devices # CONFIG_VT=y CONFIG_VT_CONSOLE=y CONFIG_HW_CONSOLE=y # CONFIG_VT_HW_CONSOLE_BINDING is not set # CONFIG_SERIAL_NONSTANDARD is not set # # Serial drivers # # CONFIG_SERIAL_8250 is not set # # Non-8250 serial port support # # CONFIG_SERIAL_JSM is not set CONFIG_UNIX98_PTYS=y CONFIG_LEGACY_PTYS=y CONFIG_LEGACY_PTY_COUNT=256 # # IPMI # # CONFIG_IPMI_HANDLER is not set # # Watchdog Cards # # CONFIG_WATCHDOG is not set CONFIG_HW_RANDOM=y CONFIG_HW_RANDOM_INTEL=y # CONFIG_HW_RANDOM_AMD is not set # CONFIG_HW_RANDOM_GEODE is not set # CONFIG_HW_RANDOM_VIA is not set CONFIG_NVRAM=y CONFIG_RTC=y # CONFIG_DTLK is not set # CONFIG_R3964 is not set # CONFIG_APPLICOM is not set # CONFIG_SONYPI is not set CONFIG_AGP=y # CONFIG_AGP_ALI is not set # CONFIG_AGP_ATI is not set # CONFIG_AGP_AMD is not set # CONFIG_AGP_AMD64 is not set CONFIG_AGP_INTEL=y # CONFIG_AGP_NVIDIA is not set # CONFIG_AGP_SIS is not set # CONFIG_AGP_SWORKS is not set # CONFIG_AGP_VIA is not set # CONFIG_AGP_EFFICEON is not set CONFIG_DRM=y # CONFIG_DRM_TDFX is not set # CONFIG_DRM_R128 is not set # CONFIG_DRM_RADEON is not set # CONFIG_DRM_I810 is not set # CONFIG_DRM_I830 is not set CONFIG_DRM_I915=y # CONFIG_DRM_MGA is not set # CONFIG_DRM_SIS is not set # CONFIG_DRM_VIA is not set # CONFIG_DRM_SAVAGE is not set # CONFIG_MWAVE is not set # CONFIG_PC8736x_GPIO is not set # CONFIG_NSC_GPIO is not set # CONFIG_CS5535_GPIO is not set # CONFIG_RAW_DRIVER is not set # CONFIG_HPET is not set # CONFIG_HANGCHECK_TIMER is not set # # TPM devices # # CONFIG_TCG_TPM is not set # CONFIG_TELCLOCK is not set # # I2C support # # CONFIG_I2C is not set # # SPI support # # CONFIG_SPI is not set # CONFIG_SPI_MASTER is not set # # Dallas's 1-wire bus # # CONFIG_W1 is not set # # Hardware Monitoring support # # CONFIG_HWMON is not set # CONFIG_HWMON_VID is not set # # Multimedia devices # # CONFIG_VIDEO_DEV is not set # # Digital Video Broadcasting Devices # # CONFIG_DVB is not set # CONFIG_USB_DABUSB is not set # # Graphics support # # CONFIG_FIRMWARE_EDID is not set CONFIG_FB=y CONFIG_FB_CFB_FILLRECT=y CONFIG_FB_CFB_COPYAREA=y CONFIG_FB_CFB_IMAGEBLIT=y # CONFIG_FB_MACMODES is not set # CONFIG_FB_BACKLIGHT is not set CONFIG_FB_MODE_HELPERS=y # CONFIG_FB_TILEBLITTING is not set # CONFIG_FB_CIRRUS is not set # CONFIG_FB_PM2 is not set # CONFIG_FB_CYBER2000 is not set # CONFIG_FB_ARC is not set # CONFIG_FB_ASILIANT is not set # CONFIG_FB_IMSTT is not set # CONFIG_FB_VGA16 is not set CONFIG_FB_VESA=y # CONFIG_FB_HGA is not set # CONFIG_FB_S1D13XXX is not set # CONFIG_FB_NVIDIA is not set # CONFIG_FB_RIVA is not set CONFIG_FB_I810=y CONFIG_FB_I810_GTF=y # CONFIG_FB_I810_I2C is not set CONFIG_FB_INTEL=y # CONFIG_FB_INTEL_DEBUG is not set # CONFIG_FB_INTEL_I2C is not set # CONFIG_FB_MATROX is not set # CONFIG_FB_RADEON is not set # CONFIG_FB_ATY128 is not set # CONFIG_FB_ATY is not set # CONFIG_FB_SAVAGE is not set # CONFIG_FB_SIS is not set # CONFIG_FB_NEOMAGIC is not set # CONFIG_FB_KYRO is not set # CONFIG_FB_3DFX is not set # CONFIG_FB_VOODOO1 is not set # CONFIG_FB_CYBLA is not set # CONFIG_FB_TRIDENT is not set # CONFIG_FB_GEODE is not set # CONFIG_FB_VIRTUAL is not set # # Console display driver support # CONFIG_VGA_CONSOLE=y # CONFIG_VGACON_SOFT_SCROLLBACK is not set CONFIG_VIDEO_SELECT=y CONFIG_DUMMY_CONSOLE=y CONFIG_FRAMEBUFFER_CONSOLE=y # CONFIG_FRAMEBUFFER_CONSOLE_ROTATION is not set # CONFIG_FONTS is not set CONFIG_FONT_8x8=y CONFIG_FONT_8x16=y # # Logo configuration # # CONFIG_LOGO is not set CONFIG_BACKLIGHT_LCD_SUPPORT=y CONFIG_BACKLIGHT_CLASS_DEVICE=y CONFIG_BACKLIGHT_DEVICE=y CONFIG_LCD_CLASS_DEVICE=y CONFIG_LCD_DEVICE=y # # Sound # CONFIG_SOUND=y # # Advanced Linux Sound Architecture # CONFIG_SND=y CONFIG_SND_TIMER=y CONFIG_SND_PCM=y CONFIG_SND_SEQUENCER=y # CONFIG_SND_SEQ_DUMMY is not set # CONFIG_SND_MIXER_OSS is not set # CONFIG_SND_PCM_OSS is not set # CONFIG_SND_SEQUENCER_OSS is not set CONFIG_SND_RTCTIMER=y CONFIG_SND_SEQ_RTCTIMER_DEFAULT=y # CONFIG_SND_DYNAMIC_MINORS is not set CONFIG_SND_SUPPORT_OLD_API=y CONFIG_SND_VERBOSE_PROCFS=y # CONFIG_SND_VERBOSE_PRINTK is not set # CONFIG_SND_DEBUG is not set # # Generic devices # CONFIG_SND_AC97_CODEC=y # CONFIG_SND_DUMMY is not set # CONFIG_SND_VIRMIDI is not set # CONFIG_SND_MTPAV is not set # CONFIG_SND_SERIAL_U16550 is not set # CONFIG_SND_MPU401 is not set # # PCI devices # # CONFIG_SND_AD1889 is not set # CONFIG_SND_ALS300 is not set # CONFIG_SND_ALS4000 is not set # CONFIG_SND_ALI5451 is not set # CONFIG_SND_ATIIXP is not set # CONFIG_SND_ATIIXP_MODEM is not set # CONFIG_SND_AU8810 is not set # CONFIG_SND_AU8820 is not set # CONFIG_SND_AU8830 is not set # CONFIG_SND_AZT3328 is not set # CONFIG_SND_BT87X is not set # CONFIG_SND_CA0106 is not set # CONFIG_SND_CMIPCI is not set # CONFIG_SND_CS4281 is not set # CONFIG_SND_CS46XX is not set # CONFIG_SND_CS5535AUDIO is not set # CONFIG_SND_DARLA20 is not set # CONFIG_SND_GINA20 is not set # CONFIG_SND_LAYLA20 is not set # CONFIG_SND_DARLA24 is not set # CONFIG_SND_GINA24 is not set # CONFIG_SND_LAYLA24 is not set # CONFIG_SND_MONA is not set # CONFIG_SND_MIA is not set # CONFIG_SND_ECHO3G is not set # CONFIG_SND_INDIGO is not set # CONFIG_SND_INDIGOIO is not set # CONFIG_SND_INDIGODJ is not set # CONFIG_SND_EMU10K1 is not set # CONFIG_SND_EMU10K1X is not set # CONFIG_SND_ENS1370 is not set # CONFIG_SND_ENS1371 is not set # CONFIG_SND_ES1938 is not set # CONFIG_SND_ES1968 is not set # CONFIG_SND_FM801 is not set CONFIG_SND_HDA_INTEL=y # CONFIG_SND_HDSP is not set # CONFIG_SND_HDSPM is not set # CONFIG_SND_ICE1712 is not set # CONFIG_SND_ICE1724 is not set CONFIG_SND_INTEL8X0=y CONFIG_SND_INTEL8X0M=y # CONFIG_SND_KORG1212 is not set # CONFIG_SND_MAESTRO3 is not set # CONFIG_SND_MIXART is not set # CONFIG_SND_NM256 is not set # CONFIG_SND_PCXHR is not set # CONFIG_SND_RIPTIDE is not set # CONFIG_SND_RME32 is not set # CONFIG_SND_RME96 is not set # CONFIG_SND_RME9652 is not set # CONFIG_SND_SONICVIBES is not set # CONFIG_SND_TRIDENT is not set # CONFIG_SND_VIA82XX is not set # CONFIG_SND_VIA82XX_MODEM is not set # CONFIG_SND_VX222 is not set # CONFIG_SND_YMFPCI is not set # CONFIG_SND_AC97_POWER_SAVE is not set # # USB devices # # CONFIG_SND_USB_AUDIO is not set # CONFIG_SND_USB_USX2Y is not set # # Open Sound System # # CONFIG_SOUND_PRIME is not set CONFIG_AC97_BUS=y # # HID Devices # # CONFIG_HID is not set # # USB support # CONFIG_USB_ARCH_HAS_HCD=y CONFIG_USB_ARCH_HAS_OHCI=y CONFIG_USB_ARCH_HAS_EHCI=y CONFIG_USB=y # CONFIG_USB_DEBUG is not set # # Miscellaneous USB options # # CONFIG_USB_DEVICEFS is not set # CONFIG_USB_BANDWIDTH is not set # CONFIG_USB_DYNAMIC_MINORS is not set # CONFIG_USB_SUSPEND is not set # CONFIG_USB_MULTITHREAD_PROBE is not set # CONFIG_USB_OTG is not set # # USB Host Controller Drivers # CONFIG_USB_EHCI_HCD=y # CONFIG_USB_EHCI_SPLIT_ISO is not set # CONFIG_USB_EHCI_ROOT_HUB_TT is not set # CONFIG_USB_EHCI_TT_NEWSCHED is not set # CONFIG_USB_ISP116X_HCD is not set # CONFIG_USB_OHCI_HCD is not set CONFIG_USB_UHCI_HCD=y # CONFIG_USB_SL811_HCD is not set # # USB Device Class drivers # # CONFIG_USB_ACM is not set # CONFIG_USB_PRINTER is not set # # NOTE: USB_STORAGE enables SCSI, and 'SCSI disk support' # # # may also be needed; see USB_STORAGE Help for more information # CONFIG_USB_STORAGE=y # CONFIG_USB_STORAGE_DEBUG is not set # CONFIG_USB_STORAGE_DATAFAB is not set # CONFIG_USB_STORAGE_FREECOM is not set # CONFIG_USB_STORAGE_ISD200 is not set # CONFIG_USB_STORAGE_DPCM is not set # CONFIG_USB_STORAGE_USBAT is not set # CONFIG_USB_STORAGE_SDDR09 is not set # CONFIG_USB_STORAGE_SDDR55 is not set # CONFIG_USB_STORAGE_JUMPSHOT is not set # CONFIG_USB_STORAGE_ALAUDA is not set # CONFIG_USB_STORAGE_KARMA is not set # CONFIG_USB_LIBUSUAL is not set # # USB Input Devices # # # USB HID Boot Protocol drivers # # CONFIG_USB_KBD is not set # CONFIG_USB_MOUSE is not set # CONFIG_USB_AIPTEK is not set # CONFIG_USB_WACOM is not set # CONFIG_USB_ACECAD is not set # CONFIG_USB_KBTAB is not set # CONFIG_USB_POWERMATE is not set # CONFIG_USB_TOUCHSCREEN is not set # CONFIG_USB_YEALINK is not set # CONFIG_USB_XPAD is not set # CONFIG_USB_ATI_REMOTE is not set # CONFIG_USB_ATI_REMOTE2 is not set # CONFIG_USB_KEYSPAN_REMOTE is not set # CONFIG_USB_APPLETOUCH is not set # # USB Imaging devices # # CONFIG_USB_MDC800 is not set # CONFIG_USB_MICROTEK is not set # # USB Network Adapters # # CONFIG_USB_CATC is not set # CONFIG_USB_KAWETH is not set # CONFIG_USB_PEGASUS is not set # CONFIG_USB_RTL8150 is not set # CONFIG_USB_USBNET_MII is not set # CONFIG_USB_USBNET is not set # CONFIG_USB_MON is not set # # USB port drivers # # # USB Serial Converter support # # CONFIG_USB_SERIAL is not set # # USB Miscellaneous drivers # # CONFIG_USB_EMI62 is not set # CONFIG_USB_EMI26 is not set # CONFIG_USB_ADUTUX is not set # CONFIG_USB_AUERSWALD is not set # CONFIG_USB_RIO500 is not set # CONFIG_USB_LEGOTOWER is not set # CONFIG_USB_LCD is not set # CONFIG_USB_LED is not set # CONFIG_USB_CYPRESS_CY7C63 is not set # CONFIG_USB_CYTHERM is not set # CONFIG_USB_PHIDGET is not set # CONFIG_USB_IDMOUSE is not set # CONFIG_USB_FTDI_ELAN is not set # CONFIG_USB_APPLEDISPLAY is not set # CONFIG_USB_SISUSBVGA is not set # CONFIG_USB_LD is not set # CONFIG_USB_TRANCEVIBRATOR is not set # # USB DSL modem support # # # USB Gadget Support # # CONFIG_USB_GADGET is not set # # MMC/SD Card support # # CONFIG_MMC is not set # # LED devices # # CONFIG_NEW_LEDS is not set # # LED drivers # # # LED Triggers # # # InfiniBand support # # CONFIG_INFINIBAND is not set # # EDAC - error detection and reporting (RAS) (EXPERIMENTAL) # # CONFIG_EDAC is not set # # Real Time Clock # # CONFIG_RTC_CLASS is not set # # DMA Engine support # # CONFIG_DMA_ENGINE is not set # # DMA Clients # # # DMA Devices # # # Virtualization # # CONFIG_KVM is not set # # File systems # CONFIG_EXT2_FS=y # CONFIG_EXT2_FS_XATTR is not set # CONFIG_EXT2_FS_XIP is not set CONFIG_EXT3_FS=y CONFIG_EXT3_FS_XATTR=y # CONFIG_EXT3_FS_POSIX_ACL is not set # CONFIG_EXT3_FS_SECURITY is not set # CONFIG_EXT4DEV_FS is not set CONFIG_JBD=y # CONFIG_JBD_DEBUG is not set CONFIG_FS_MBCACHE=y # CONFIG_REISERFS_FS is not set # CONFIG_JFS_FS is not set # CONFIG_FS_POSIX_ACL is not set # CONFIG_XFS_FS is not set # CONFIG_GFS2_FS is not set # CONFIG_OCFS2_FS is not set # CONFIG_MINIX_FS is not set # CONFIG_ROMFS_FS is not set # CONFIG_INOTIFY is not set # CONFIG_QUOTA is not set CONFIG_DNOTIFY=y # CONFIG_AUTOFS_FS is not set CONFIG_AUTOFS4_FS=y # CONFIG_FUSE_FS is not set # # CD-ROM/DVD Filesystems # CONFIG_ISO9660_FS=y CONFIG_JOLIET=y CONFIG_ZISOFS=y CONFIG_ZISOFS_FS=y CONFIG_UDF_FS=y CONFIG_UDF_NLS=y # # DOS/FAT/NT Filesystems # CONFIG_FAT_FS=y CONFIG_MSDOS_FS=y CONFIG_VFAT_FS=y CONFIG_FAT_DEFAULT_CODEPAGE=437 CONFIG_FAT_DEFAULT_IOCHARSET="iso8859-1" CONFIG_NTFS_FS=y # CONFIG_NTFS_DEBUG is not set # CONFIG_NTFS_RW is not set # # Pseudo filesystems # CONFIG_PROC_FS=y CONFIG_PROC_KCORE=y CONFIG_PROC_SYSCTL=y CONFIG_SYSFS=y CONFIG_TMPFS=y # CONFIG_TMPFS_POSIX_ACL is not set # CONFIG_HUGETLBFS is not set # CONFIG_HUGETLB_PAGE is not set CONFIG_RAMFS=y # CONFIG_CONFIGFS_FS is not set # # Miscellaneous filesystems # # CONFIG_ADFS_FS is not set # CONFIG_AFFS_FS is not set # CONFIG_HFS_FS is not set # CONFIG_HFSPLUS_FS is not set # CONFIG_BEFS_FS is not set # CONFIG_BFS_FS is not set # CONFIG_EFS_FS is not set # CONFIG_CRAMFS is not set # CONFIG_VXFS_FS is not set # CONFIG_HPFS_FS is not set # CONFIG_QNX4FS_FS is not set # CONFIG_SYSV_FS is not set CONFIG_UFS_FS=y # CONFIG_UFS_FS_WRITE is not set # CONFIG_UFS_DEBUG is not set # # Network File Systems # # CONFIG_NFS_FS is not set # CONFIG_NFSD is not set # CONFIG_SMB_FS is not set CONFIG_CIFS=y # CONFIG_CIFS_STATS is not set # CONFIG_CIFS_WEAK_PW_HASH is not set # CONFIG_CIFS_XATTR is not set # CONFIG_CIFS_DEBUG2 is not set # CONFIG_CIFS_EXPERIMENTAL is not set # CONFIG_NCP_FS is not set # CONFIG_CODA_FS is not set # CONFIG_AFS_FS is not set # CONFIG_9P_FS is not set # # Partition Types # CONFIG_PARTITION_ADVANCED=y # CONFIG_ACORN_PARTITION is not set # CONFIG_OSF_PARTITION is not set # CONFIG_AMIGA_PARTITION is not set # CONFIG_ATARI_PARTITION is not set # CONFIG_MAC_PARTITION is not set CONFIG_MSDOS_PARTITION=y CONFIG_BSD_DISKLABEL=y # CONFIG_MINIX_SUBPARTITION is not set # CONFIG_SOLARIS_X86_PARTITION is not set # CONFIG_UNIXWARE_DISKLABEL is not set # CONFIG_LDM_PARTITION is not set # CONFIG_SGI_PARTITION is not set # CONFIG_ULTRIX_PARTITION is not set # CONFIG_SUN_PARTITION is not set # CONFIG_KARMA_PARTITION is not set # CONFIG_EFI_PARTITION is not set # # Native Language Support # CONFIG_NLS=y CONFIG_NLS_DEFAULT="iso8859-1" CONFIG_NLS_CODEPAGE_437=y # CONFIG_NLS_CODEPAGE_737 is not set # CONFIG_NLS_CODEPAGE_775 is not set # CONFIG_NLS_CODEPAGE_850 is not set # CONFIG_NLS_CODEPAGE_852 is not set # CONFIG_NLS_CODEPAGE_855 is not set # CONFIG_NLS_CODEPAGE_857 is not set # CONFIG_NLS_CODEPAGE_860 is not set # CONFIG_NLS_CODEPAGE_861 is not set # CONFIG_NLS_CODEPAGE_862 is not set # CONFIG_NLS_CODEPAGE_863 is not set # CONFIG_NLS_CODEPAGE_864 is not set # CONFIG_NLS_CODEPAGE_865 is not set # CONFIG_NLS_CODEPAGE_866 is not set # CONFIG_NLS_CODEPAGE_869 is not set # CONFIG_NLS_CODEPAGE_936 is not set # CONFIG_NLS_CODEPAGE_950 is not set # CONFIG_NLS_CODEPAGE_932 is not set # CONFIG_NLS_CODEPAGE_949 is not set # CONFIG_NLS_CODEPAGE_874 is not set # CONFIG_NLS_ISO8859_8 is not set # CONFIG_NLS_CODEPAGE_1250 is not set # CONFIG_NLS_CODEPAGE_1251 is not set # CONFIG_NLS_ASCII is not set CONFIG_NLS_ISO8859_1=y # CONFIG_NLS_ISO8859_2 is not set # CONFIG_NLS_ISO8859_3 is not set # CONFIG_NLS_ISO8859_4 is not set # CONFIG_NLS_ISO8859_5 is not set # CONFIG_NLS_ISO8859_6 is not set # CONFIG_NLS_ISO8859_7 is not set # CONFIG_NLS_ISO8859_9 is not set # CONFIG_NLS_ISO8859_13 is not set # CONFIG_NLS_ISO8859_14 is not set # CONFIG_NLS_ISO8859_15 is not set # CONFIG_NLS_KOI8_R is not set # CONFIG_NLS_KOI8_U is not set # CONFIG_NLS_UTF8 is not set # # Distributed Lock Manager # # CONFIG_DLM is not set # # Instrumentation Support # # CONFIG_PROFILING is not set # # Kernel hacking # CONFIG_TRACE_IRQFLAGS_SUPPORT=y # CONFIG_PRINTK_TIME is not set # CONFIG_ENABLE_MUST_CHECK is not set CONFIG_MAGIC_SYSRQ=y # CONFIG_UNUSED_SYMBOLS is not set # CONFIG_DEBUG_FS is not set # CONFIG_HEADERS_CHECK is not set CONFIG_DEBUG_KERNEL=y CONFIG_LOG_BUF_SHIFT=14 # CONFIG_DETECT_SOFTLOCKUP is not set # CONFIG_SCHEDSTATS is not set # CONFIG_DEBUG_SLAB is not set # CONFIG_DEBUG_PREEMPT is not set # CONFIG_DEBUG_RT_MUTEXES is not set # CONFIG_RT_MUTEX_TESTER is not set # CONFIG_DEBUG_SPINLOCK is not set # CONFIG_DEBUG_MUTEXES is not set # CONFIG_DEBUG_RWSEMS is not set # CONFIG_DEBUG_LOCK_ALLOC is not set # CONFIG_PROVE_LOCKING is not set # CONFIG_DEBUG_SPINLOCK_SLEEP is not set # CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set # CONFIG_DEBUG_KOBJECT is not set # CONFIG_DEBUG_HIGHMEM is not set CONFIG_DEBUG_BUGVERBOSE=y # CONFIG_DEBUG_INFO is not set # CONFIG_DEBUG_VM is not set # CONFIG_DEBUG_LIST is not set # CONFIG_FRAME_POINTER is not set # CONFIG_FORCED_INLINING is not set # CONFIG_RCU_TORTURE_TEST is not set CONFIG_EARLY_PRINTK=y # CONFIG_DEBUG_STACKOVERFLOW is not set # CONFIG_DEBUG_STACK_USAGE is not set # # Page alloc debug is incompatible with Software Suspend on i386 # # CONFIG_DEBUG_RODATA is not set CONFIG_4KSTACKS=y CONFIG_X86_FIND_SMP_CONFIG=y CONFIG_X86_MPPARSE=y CONFIG_DOUBLEFAULT=y # # Security options # # CONFIG_KEYS is not set # CONFIG_SECURITY is not set # # Cryptographic options # CONFIG_CRYPTO=y CONFIG_CRYPTO_ALGAPI=y CONFIG_CRYPTO_MANAGER=y # CONFIG_CRYPTO_HMAC is not set # CONFIG_CRYPTO_XCBC is not set # CONFIG_CRYPTO_NULL is not set # CONFIG_CRYPTO_MD4 is not set CONFIG_CRYPTO_MD5=y # CONFIG_CRYPTO_SHA1 is not set # CONFIG_CRYPTO_SHA256 is not set # CONFIG_CRYPTO_SHA512 is not set # CONFIG_CRYPTO_WP512 is not set # CONFIG_CRYPTO_TGR192 is not set # CONFIG_CRYPTO_GF128MUL is not set # CONFIG_CRYPTO_ECB is not set # CONFIG_CRYPTO_CBC is not set # CONFIG_CRYPTO_LRW is not set # CONFIG_CRYPTO_DES is not set # CONFIG_CRYPTO_BLOWFISH is not set # CONFIG_CRYPTO_TWOFISH is not set # CONFIG_CRYPTO_TWOFISH_586 is not set # CONFIG_CRYPTO_SERPENT is not set CONFIG_CRYPTO_AES=y CONFIG_CRYPTO_AES_586=y # CONFIG_CRYPTO_CAST5 is not set # CONFIG_CRYPTO_CAST6 is not set # CONFIG_CRYPTO_TEA is not set CONFIG_CRYPTO_ARC4=y # CONFIG_CRYPTO_KHAZAD is not set # CONFIG_CRYPTO_ANUBIS is not set CONFIG_CRYPTO_DEFLATE=y CONFIG_CRYPTO_MICHAEL_MIC=y # CONFIG_CRYPTO_CRC32C is not set # # Hardware crypto devices # # CONFIG_CRYPTO_DEV_PADLOCK is not set # CONFIG_CRYPTO_DEV_GEODE is not set # # Library routines # CONFIG_BITREVERSE=y CONFIG_CRC_CCITT=y CONFIG_CRC16=y CONFIG_CRC32=y CONFIG_LIBCRC32C=y CONFIG_ZLIB_INFLATE=y CONFIG_ZLIB_DEFLATE=y CONFIG_PLIST=y CONFIG_IOMAP_COPY=y CONFIG_GENERIC_HARDIRQS=y CONFIG_GENERIC_IRQ_PROBE=y CONFIG_GENERIC_PENDING_IRQ=y CONFIG_X86_SMP=y CONFIG_X86_HT=y CONFIG_X86_BIOS_REBOOT=y CONFIG_X86_TRAMPOLINE=y CONFIG_KTIME_SCALAR=y ^ permalink raw reply related [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-18 19:44 ` Andrei Popa @ 2006-12-18 20:14 ` Linus Torvalds 2006-12-18 20:41 ` Linus Torvalds ` (3 more replies) 0 siblings, 4 replies; 311+ messages in thread From: Linus Torvalds @ 2006-12-18 20:14 UTC (permalink / raw) To: Andrei Popa Cc: Peter Zijlstra, Andrew Morton, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Mon, 18 Dec 2006, Andrei Popa wrote: > > I dropped that patch and added WARN_ON(1), the unified patch is > attached. > > I got corruption: "Hash check on download completion found bad chunks, > consider using "safe_sync"." Ok. That is actually _very_ interesting. It's interesting because (a) the corruption obviously goes away with the one-liner that effectively disables "page_mkclean_one()". So that tells us that yes, it's a PTE dirty bit that matters. But at the same time, it's interesting that it still happens when we try to re-add the dirty bit. That would tell me that it's one of two cases: - there is another caller of page cleaning that should have done the same thing (we could check that by just doing this all _inside_ the page_mkclean() thing) OR: - page_mkclean_one() is simply buggy. And I'm starting to wonder about the second case. But it all LOOKS really fine - I can't see anything wrong there (it uses the extremely conservative "ptep_get_and_clear()", and seems to flush everything right too, through "ptep_establish()"). Linus ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-18 20:14 ` Linus Torvalds @ 2006-12-18 20:41 ` Linus Torvalds 2006-12-18 21:11 ` Andrei Popa 2006-12-18 22:34 ` Gene Heskett 2006-12-18 21:43 ` Andrew Morton ` (2 subsequent siblings) 3 siblings, 2 replies; 311+ messages in thread From: Linus Torvalds @ 2006-12-18 20:41 UTC (permalink / raw) To: Andrei Popa Cc: Peter Zijlstra, Andrew Morton, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Mon, 18 Dec 2006, Linus Torvalds wrote: > > But at the same time, it's interesting that it still happens when we try > to re-add the dirty bit. That would tell me that it's one of two cases: Forget that. There's a third case, which is much more likely: - Andrew's patch had a ", 1" where it _should_ have had a ", 0". This should be fairly easy to test: just change every single ", 1" case in the patch to ", 0". The only case that _definitely_ would want ",1" is actually the case that already calls page_mkclean() directly: clear_page_dirty_for_io(). So no other ", 1" is valid, and that one that needed it already avoided even calling the "test_clear_page_dirty()" function, because it did it all by hand. What happens for you in that case? Linus ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-18 20:41 ` Linus Torvalds @ 2006-12-18 21:11 ` Andrei Popa 2006-12-18 22:00 ` Alessandro Suardi 2006-12-18 22:32 ` Linus Torvalds 2006-12-18 22:34 ` Gene Heskett 1 sibling, 2 replies; 311+ messages in thread From: Andrei Popa @ 2006-12-18 21:11 UTC (permalink / raw) To: Linus Torvalds Cc: Peter Zijlstra, Andrew Morton, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Mon, 2006-12-18 at 12:41 -0800, Linus Torvalds wrote: > > On Mon, 18 Dec 2006, Linus Torvalds wrote: > > > > But at the same time, it's interesting that it still happens when we try > > to re-add the dirty bit. That would tell me that it's one of two cases: > > Forget that. There's a third case, which is much more likely: > > - Andrew's patch had a ", 1" where it _should_ have had a ", 0". > > This should be fairly easy to test: just change every single ", 1" case in > the patch to ", 0". > > The only case that _definitely_ would want ",1" is actually the case that > already calls page_mkclean() directly: clear_page_dirty_for_io(). So no > other ", 1" is valid, and that one that needed it already avoided even > calling the "test_clear_page_dirty()" function, because it did it all by > hand. > > What happens for you in that case? > > Linus I have file corruption. diff --git a/fs/buffer.c b/fs/buffer.c index d1f1b54..263f88e 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *pag int ret = 0; BUG_ON(!PageLocked(page)); - if (PageWriteback(page)) + if (PageDirty(page) || PageWriteback(page)) return 0; if (mapping == NULL) { /* can this still happen? */ @@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *pag spin_lock(&mapping->private_lock); ret = drop_buffers(page, &buffers_to_free); spin_unlock(&mapping->private_lock); - if (ret) { - /* - * If the filesystem writes its buffers by hand (eg ext3) - * then we can have clean buffers against a dirty page. We - * clean the page here; otherwise later reattachment of buffers - * could encounter a non-uptodate page, which is unresolvable. - * This only applies in the rare case where try_to_free_buffers - * succeeds but the page is not freed. - * - * Also, during truncate, discard_buffer will have marked all - * the page's buffers clean. We discover that here and clean - * the page also. - */ - if (test_clear_page_dirty(page)) - task_io_account_cancelled_write(PAGE_CACHE_SIZE); - } out: if (buffers_to_free) { struct buffer_head *bh = buffers_to_free; diff --git a/fs/cifs/file.c b/fs/cifs/file.c index 0f05cab..760442f 100644 --- a/fs/cifs/file.c +++ b/fs/cifs/file.c @@ -1245,7 +1245,7 @@ retry: wait_on_page_writeback(page); if (PageWriteback(page) || - !test_clear_page_dirty(page)) { + !test_clear_page_dirty(page, 0)) { unlock_page(page); break; } diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 1387749..da2bdb1 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -484,7 +484,7 @@ static int fuse_commit_write(struct file spin_unlock(&fc->lock); if (offset == 0 && to == PAGE_CACHE_SIZE) { - clear_page_dirty(page); + clear_page_dirty(page, 0); SetPageUptodate(page); } } diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index ed2c223..7b87875 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -176,7 +176,7 @@ static int hugetlbfs_commit_write(struct static void truncate_huge_page(struct page *page) { - clear_page_dirty(page); + clear_page_dirty(page, 0); ClearPageUptodate(page); remove_from_page_cache(page); put_page(page); diff --git a/fs/jfs/jfs_metapage.c b/fs/jfs/jfs_metapage.c index b1a1c72..47a6b62 100644 --- a/fs/jfs/jfs_metapage.c +++ b/fs/jfs/jfs_metapage.c @@ -773,7 +773,7 @@ #if MPS_PER_PAGE == 1 /* Retest mp->count since we may have released page lock */ if (test_bit(META_discard, &mp->flag) && !mp->count) { - clear_page_dirty(page); + clear_page_dirty(page, 0); ClearPageUptodate(page); } #else diff --git a/fs/reiserfs/stree.c b/fs/reiserfs/stree.c index 47e7027..a97e198 100644 --- a/fs/reiserfs/stree.c +++ b/fs/reiserfs/stree.c @@ -1459,7 +1459,7 @@ static void unmap_buffers(struct page *p bh = next; } while (bh != head); if (PAGE_SIZE == bh->b_size) { - clear_page_dirty(page); + clear_page_dirty(page, 0); } } } diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c index b56eb75..d65ba84 100644 --- a/fs/xfs/linux-2.6/xfs_aops.c +++ b/fs/xfs/linux-2.6/xfs_aops.c @@ -343,7 +343,7 @@ xfs_start_page_writeback( ASSERT(!PageWriteback(page)); set_page_writeback(page); if (clear_dirty) - clear_page_dirty(page); + clear_page_dirty(page, 0); unlock_page(page); if (!buffers) { end_page_writeback(page); diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 4830a3b..175ab3c 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -253,13 +253,13 @@ #define ClearPageUncached(page) clear_bi struct page; /* forward declaration */ -int test_clear_page_dirty(struct page *page); +int test_clear_page_dirty(struct page *page, int must_clean_ptes); int test_clear_page_writeback(struct page *page); int test_set_page_writeback(struct page *page); -static inline void clear_page_dirty(struct page *page) +static inline void clear_page_dirty(struct page *page, int must_clean_ptes) { - test_clear_page_dirty(page); + test_clear_page_dirty(page, must_clean_ptes); } static inline void set_page_writeback(struct page *page) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 237107c..f7e0cc8 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -848,7 +848,7 @@ EXPORT_SYMBOL(set_page_dirty_lock); * Clear a page's dirty flag, while caring for dirty memory accounting. * Returns true if the page was previously dirty. */ -int test_clear_page_dirty(struct page *page) +int test_clear_page_dirty(struct page *page, int must_clean_ptes) { struct address_space *mapping = page_mapping(page); unsigned long flags; @@ -866,7 +866,12 @@ int test_clear_page_dirty(struct page *p * page is locked, which pins the address_space */ if (mapping_cap_account_dirty(mapping)) { - page_mkclean(page); + int cleaned = page_mkclean(page); + if (!must_clean_ptes && cleaned){ + WARN_ON(1); + set_page_dirty(page); + } + dec_zone_page_state(page, NR_FILE_DIRTY); } return 1; diff --git a/mm/rmap.c b/mm/rmap.c diff --git a/mm/truncate.c b/mm/truncate.c index 9bfb8e8..cafa843 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -70,7 +70,7 @@ truncate_complete_page(struct address_sp if (PagePrivate(page)) do_invalidatepage(page, 0); - if (test_clear_page_dirty(page)) + if (test_clear_page_dirty(page, 0)) task_io_account_cancelled_write(PAGE_CACHE_SIZE); ClearPageUptodate(page); ClearPageMappedToDisk(page); @@ -386,7 +386,7 @@ int invalidate_inode_pages2_range(struct PAGE_CACHE_SIZE, 0); } } - was_dirty = test_clear_page_dirty(page); + was_dirty = test_clear_page_dirty(page, 0); if (!invalidate_complete_page2(mapping, page)) { if (was_dirty) set_page_dirty(page); ^ permalink raw reply related [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-18 21:11 ` Andrei Popa @ 2006-12-18 22:00 ` Alessandro Suardi 2006-12-18 22:45 ` Linus Torvalds 2006-12-18 22:32 ` Linus Torvalds 1 sibling, 1 reply; 311+ messages in thread From: Alessandro Suardi @ 2006-12-18 22:00 UTC (permalink / raw) To: andrei.popa Cc: Linus Torvalds, Peter Zijlstra, Andrew Morton, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On 12/18/06, Andrei Popa <andrei.popa@i-neo.ro> wrote: > On Mon, 2006-12-18 at 12:41 -0800, Linus Torvalds wrote: > > > > On Mon, 18 Dec 2006, Linus Torvalds wrote: > > > > > > But at the same time, it's interesting that it still happens when we try > > > to re-add the dirty bit. That would tell me that it's one of two cases: > > > > Forget that. There's a third case, which is much more likely: > > > > - Andrew's patch had a ", 1" where it _should_ have had a ", 0". > > > > This should be fairly easy to test: just change every single ", 1" case in > > the patch to ", 0". > > > > The only case that _definitely_ would want ",1" is actually the case that > > already calls page_mkclean() directly: clear_page_dirty_for_io(). So no > > other ", 1" is valid, and that one that needed it already avoided even > > calling the "test_clear_page_dirty()" function, because it did it all by > > hand. > > > > What happens for you in that case? > > > > Linus > > I have file corruption. No idea whether this can be a data point or not, but here it goes... my P2P box is about to turn 5 days old while running nonstop one or both of aMule 2.1.3 and BitTorrent 4.4.0 on ext3 mounted w/default options on both IDE and USB disks. Zero corruption. AMD K7-800, 512MB RAM, PREEMPT/UP kernel, 2.6.19-git20 on top of up-to-date FC6. --alessandro "...when I get it, I _get_ it" (Lara Eidemiller) ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-18 22:00 ` Alessandro Suardi @ 2006-12-18 22:45 ` Linus Torvalds 2006-12-19 0:13 ` Andrei Popa 0 siblings, 1 reply; 311+ messages in thread From: Linus Torvalds @ 2006-12-18 22:45 UTC (permalink / raw) To: Alessandro Suardi Cc: andrei.popa, Peter Zijlstra, Andrew Morton, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Mon, 18 Dec 2006, Alessandro Suardi wrote: > > No idea whether this can be a data point or not, but > here it goes... my P2P box is about to turn 5 days old > while running nonstop one or both of aMule 2.1.3 and > BitTorrent 4.4.0 on ext3 mounted w/default options > on both IDE and USB disks. Zero corruption. > > AMD K7-800, 512MB RAM, PREEMPT/UP kernel, > 2.6.19-git20 on top of up-to-date FC6. It _looks_ like PREEMPT/SMP is one common configuration. It might also be that the blocksize of the filesystem matters. 4kB filesystems are fundamentally simpler than 1kB filesystems, for example. You can tell at least with "/sbin/dumpe2fs -h /dev/..." or something. Andrei - one thing that might be interesting to see: when corruption occurs, can you get the corrupted file somehow? And compare it with a known-good copy to see what the corruption looks like? Linus ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-18 22:45 ` Linus Torvalds @ 2006-12-19 0:13 ` Andrei Popa 2006-12-19 0:29 ` Linus Torvalds 0 siblings, 1 reply; 311+ messages in thread From: Andrei Popa @ 2006-12-19 0:13 UTC (permalink / raw) To: Linus Torvalds Cc: Alessandro Suardi, Peter Zijlstra, Andrew Morton, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Mon, 2006-12-18 at 14:45 -0800, Linus Torvalds wrote: > > On Mon, 18 Dec 2006, Alessandro Suardi wrote: > > > > No idea whether this can be a data point or not, but > > here it goes... my P2P box is about to turn 5 days old > > while running nonstop one or both of aMule 2.1.3 and > > BitTorrent 4.4.0 on ext3 mounted w/default options > > on both IDE and USB disks. Zero corruption. > > > > AMD K7-800, 512MB RAM, PREEMPT/UP kernel, > > 2.6.19-git20 on top of up-to-date FC6. > > It _looks_ like PREEMPT/SMP is one common configuration. > > It might also be that the blocksize of the filesystem matters. 4kB > filesystems are fundamentally simpler than 1kB filesystems, for example. > You can tell at least with "/sbin/dumpe2fs -h /dev/..." or something. > > Andrei - one thing that might be interesting to see: when corruption > occurs, can you get the corrupted file somehow? And compare it with a > known-good copy to see what the corruption looks like? the corrupted file has a chink full with zeros http://193.226.119.62/corruption0.jpg http://193.226.119.62/corruption1.jpg ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-19 0:13 ` Andrei Popa @ 2006-12-19 0:29 ` Linus Torvalds 0 siblings, 0 replies; 311+ messages in thread From: Linus Torvalds @ 2006-12-19 0:29 UTC (permalink / raw) To: Andrei Popa Cc: Alessandro Suardi, Peter Zijlstra, Andrew Morton, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Tue, 19 Dec 2006, Andrei Popa wrote: > > the corrupted file has a chink full with zeros > > http://193.226.119.62/corruption0.jpg > http://193.226.119.62/corruption1.jpg Thanks. Yup, filled with zeroes, and the corruption stops (but does _not_ start) at a page boundary. That _does_ look very much like it was filled in linearly, then written out to disk when it was in the middle of the page, and then we simply lost the further writes that should also have gone on to that page. All consistent with dropping a dirty bit somewhere in the middle of the page updates. Which we kind of knew must be the issue anyway, but it's good to know that the corruption pattern is consistent with what we're trying to figure out. Linus ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-18 21:11 ` Andrei Popa 2006-12-18 22:00 ` Alessandro Suardi @ 2006-12-18 22:32 ` Linus Torvalds 2006-12-18 23:48 ` Andrei Popa 1 sibling, 1 reply; 311+ messages in thread From: Linus Torvalds @ 2006-12-18 22:32 UTC (permalink / raw) To: Andrei Popa Cc: Peter Zijlstra, Andrew Morton, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Mon, 18 Dec 2006, Andrei Popa wrote: > > > > This should be fairly easy to test: just change every single ", 1" case in > > the patch to ", 0". > > > > What happens for you in that case? > > I have file corruption. Magic. And btw, _thanks_ for being such a great tester. So now I have one more thng for you to try, it you can bother: There's exactly two call sites that call "page_mkclean()" (an dthat is the only thing in turn that calls "page_mkclean_one()", which we already determined will cause the corruption). Both of them do if (mapping_cap_account_dirty(mapping)) { .. things, although they do slightly different things inside that if in your patched kernel. Can you just TOTALLY DISABLE that case for the test_clear_page_dirty() case? Just do an "#if 0 .. #endif" around that whole if-statement, leaving the _only_ thing that actually calls "page_mkclean()" to be the "clear_page_dirty_for_io()" call. Do you still see corruption? Linus ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-18 22:32 ` Linus Torvalds @ 2006-12-18 23:48 ` Andrei Popa 2006-12-19 0:04 ` Linus Torvalds 2006-12-19 1:03 ` Gene Heskett 0 siblings, 2 replies; 311+ messages in thread From: Andrei Popa @ 2006-12-18 23:48 UTC (permalink / raw) To: Linus Torvalds Cc: Peter Zijlstra, Andrew Morton, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Mon, 2006-12-18 at 14:32 -0800, Linus Torvalds wrote: > > On Mon, 18 Dec 2006, Andrei Popa wrote: > > > > > > This should be fairly easy to test: just change every single ", 1" case in > > > the patch to ", 0". > > > > > > What happens for you in that case? > > > > I have file corruption. > > Magic. And btw, _thanks_ for being such a great tester. > > So now I have one more thng for you to try, it you can bother: > > There's exactly two call sites that call "page_mkclean()" (an dthat is the > only thing in turn that calls "page_mkclean_one()", which we already > determined will cause the corruption). > > Both of them do > > if (mapping_cap_account_dirty(mapping)) { > .. > > things, although they do slightly different things inside that if in your > patched kernel. > > Can you just TOTALLY DISABLE that case for the test_clear_page_dirty() > case? Just do an "#if 0 .. #endif" around that whole if-statement, leaving > the _only_ thing that actually calls "page_mkclean()" to be the > "clear_page_dirty_for_io()" call. > > Do you still see corruption? nope, no file corruption at all. diff --git a/fs/buffer.c b/fs/buffer.c index d1f1b54..263f88e 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *pag int ret = 0; BUG_ON(!PageLocked(page)); - if (PageWriteback(page)) + if (PageDirty(page) || PageWriteback(page)) return 0; if (mapping == NULL) { /* can this still happen? */ @@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *pag spin_lock(&mapping->private_lock); ret = drop_buffers(page, &buffers_to_free); spin_unlock(&mapping->private_lock); - if (ret) { - /* - * If the filesystem writes its buffers by hand (eg ext3) - * then we can have clean buffers against a dirty page. We - * clean the page here; otherwise later reattachment of buffers - * could encounter a non-uptodate page, which is unresolvable. - * This only applies in the rare case where try_to_free_buffers - * succeeds but the page is not freed. - * - * Also, during truncate, discard_buffer will have marked all - * the page's buffers clean. We discover that here and clean - * the page also. - */ - if (test_clear_page_dirty(page)) - task_io_account_cancelled_write(PAGE_CACHE_SIZE); - } out: if (buffers_to_free) { struct buffer_head *bh = buffers_to_free; diff --git a/fs/cifs/file.c b/fs/cifs/file.c index 0f05cab..2d8bbbb 100644 --- a/fs/cifs/file.c +++ b/fs/cifs/file.c @@ -1245,7 +1245,7 @@ retry: wait_on_page_writeback(page); if (PageWriteback(page) || - !test_clear_page_dirty(page)) { + !test_clear_page_dirty(page, 0)) { unlock_page(page); break; } diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 1387749..da2bdb1 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -484,7 +484,7 @@ static int fuse_commit_write(struct file spin_unlock(&fc->lock); if (offset == 0 && to == PAGE_CACHE_SIZE) { - clear_page_dirty(page); + clear_page_dirty(page, 0); SetPageUptodate(page); } } diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index ed2c223..9f82cd0 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -176,7 +176,7 @@ static int hugetlbfs_commit_write(struct static void truncate_huge_page(struct page *page) { - clear_page_dirty(page); + clear_page_dirty(page, 0); ClearPageUptodate(page); remove_from_page_cache(page); put_page(page); diff --git a/fs/jfs/jfs_metapage.c b/fs/jfs/jfs_metapage.c index b1a1c72..5e29b37 100644 --- a/fs/jfs/jfs_metapage.c +++ b/fs/jfs/jfs_metapage.c @@ -773,7 +773,7 @@ #if MPS_PER_PAGE == 1 /* Retest mp->count since we may have released page lock */ if (test_bit(META_discard, &mp->flag) && !mp->count) { - clear_page_dirty(page); + clear_page_dirty(page, 0); ClearPageUptodate(page); } #else diff --git a/fs/reiserfs/stree.c b/fs/reiserfs/stree.c index 47e7027..a97e198 100644 --- a/fs/reiserfs/stree.c +++ b/fs/reiserfs/stree.c @@ -1459,7 +1459,7 @@ static void unmap_buffers(struct page *p bh = next; } while (bh != head); if (PAGE_SIZE == bh->b_size) { - clear_page_dirty(page); + clear_page_dirty(page, 0); } } } diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c index b56eb75..44ac434 100644 --- a/fs/xfs/linux-2.6/xfs_aops.c +++ b/fs/xfs/linux-2.6/xfs_aops.c @@ -343,7 +343,7 @@ xfs_start_page_writeback( ASSERT(!PageWriteback(page)); set_page_writeback(page); if (clear_dirty) - clear_page_dirty(page); + clear_page_dirty(page, 0); unlock_page(page); if (!buffers) { end_page_writeback(page); diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 4830a3b..175ab3c 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -253,13 +253,13 @@ #define ClearPageUncached(page) clear_bi struct page; /* forward declaration */ -int test_clear_page_dirty(struct page *page); +int test_clear_page_dirty(struct page *page, int must_clean_ptes); int test_clear_page_writeback(struct page *page); int test_set_page_writeback(struct page *page); -static inline void clear_page_dirty(struct page *page) +static inline void clear_page_dirty(struct page *page, int must_clean_ptes) { - test_clear_page_dirty(page); + test_clear_page_dirty(page, must_clean_ptes); } static inline void set_page_writeback(struct page *page) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 237107c..f2a157d 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -848,7 +848,7 @@ EXPORT_SYMBOL(set_page_dirty_lock); * Clear a page's dirty flag, while caring for dirty memory accounting. * Returns true if the page was previously dirty. */ -int test_clear_page_dirty(struct page *page) +int test_clear_page_dirty(struct page *page, int must_clean_ptes) { struct address_space *mapping = page_mapping(page); unsigned long flags; @@ -857,6 +857,8 @@ int test_clear_page_dirty(struct page *p return TestClearPageDirty(page); write_lock_irqsave(&mapping->tree_lock, flags); + +#if 0 if (TestClearPageDirty(page)) { radix_tree_tag_clear(&mapping->page_tree, page_index(page), PAGECACHE_TAG_DIRTY); @@ -866,11 +868,19 @@ int test_clear_page_dirty(struct page *p * page is locked, which pins the address_space */ if (mapping_cap_account_dirty(mapping)) { - page_mkclean(page); + int cleaned = page_mkclean(page); + if (!must_clean_ptes && cleaned){ + WARN_ON(1); + set_page_dirty(page); + } + dec_zone_page_state(page, NR_FILE_DIRTY); } return 1; } + +#endif + write_unlock_irqrestore(&mapping->tree_lock, flags); return 0; } diff --git a/mm/rmap.c b/mm/rmap.c diff --git a/mm/truncate.c b/mm/truncate.c index 9bfb8e8..9a01d9e 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -70,7 +70,7 @@ truncate_complete_page(struct address_sp if (PagePrivate(page)) do_invalidatepage(page, 0); - if (test_clear_page_dirty(page)) + if (test_clear_page_dirty(page, 0)) task_io_account_cancelled_write(PAGE_CACHE_SIZE); ClearPageUptodate(page); ClearPageMappedToDisk(page); @@ -386,7 +386,7 @@ int invalidate_inode_pages2_range(struct PAGE_CACHE_SIZE, 0); } } - was_dirty = test_clear_page_dirty(page); + was_dirty = test_clear_page_dirty(page, 0); if (!invalidate_complete_page2(mapping, page)) { if (was_dirty) set_page_dirty(page); ^ permalink raw reply related [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-18 23:48 ` Andrei Popa @ 2006-12-19 0:04 ` Linus Torvalds 2006-12-19 0:29 ` Andrei Popa 2006-12-19 1:03 ` Gene Heskett 1 sibling, 1 reply; 311+ messages in thread From: Linus Torvalds @ 2006-12-19 0:04 UTC (permalink / raw) To: Andrei Popa Cc: Peter Zijlstra, Andrew Morton, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Tue, 19 Dec 2006, Andrei Popa wrote: > > > > There's exactly two call sites that call "page_mkclean()" (an dthat is the > > only thing in turn that calls "page_mkclean_one()", which we already > > determined will cause the corruption). > > > > Can you just TOTALLY DISABLE that case for the test_clear_page_dirty() > > case? Just do an "#if 0 .. #endif" around that whole if-statement, leaving > > the _only_ thing that actually calls "page_mkclean()" to be the > > "clear_page_dirty_for_io()" call. > > > > Do you still see corruption? > > nope, no file corruption at all. Ok. That's interesting, but I think you actually #ifdef'ed out too much: > + > +#if 0 > if (TestClearPageDirty(page)) { > radix_tree_tag_clear(&mapping->page_tree, > page_index(page), PAGECACHE_TAG_DIRTY); > @@ -866,11 +868,19 @@ int test_clear_page_dirty(struct page *p > * page is locked, which pins the address_space > */ > if (mapping_cap_account_dirty(mapping)) { > - page_mkclean(page); > + int cleaned = page_mkclean(page); > + if (!must_clean_ptes && cleaned){ > + WARN_ON(1); > + set_page_dirty(page); > + } > + > dec_zone_page_state(page, NR_FILE_DIRTY); > } > return 1; > } > + > +#endif > + It was really just the _inner_ "if (mapping_cap_account_dirty(.." statement that I meant you should remove. Can you try that too? Linus ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-19 0:04 ` Linus Torvalds @ 2006-12-19 0:29 ` Andrei Popa 2006-12-19 0:57 ` Linus Torvalds 0 siblings, 1 reply; 311+ messages in thread From: Andrei Popa @ 2006-12-19 0:29 UTC (permalink / raw) To: Linus Torvalds Cc: Peter Zijlstra, Andrew Morton, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Mon, 2006-12-18 at 16:04 -0800, Linus Torvalds wrote: > > On Tue, 19 Dec 2006, Andrei Popa wrote: > > > > > > There's exactly two call sites that call "page_mkclean()" (an dthat is the > > > only thing in turn that calls "page_mkclean_one()", which we already > > > determined will cause the corruption). > > > > > > Can you just TOTALLY DISABLE that case for the test_clear_page_dirty() > > > case? Just do an "#if 0 .. #endif" around that whole if-statement, leaving > > > the _only_ thing that actually calls "page_mkclean()" to be the > > > "clear_page_dirty_for_io()" call. > > > > > > Do you still see corruption? > > > > nope, no file corruption at all. > > Ok. That's interesting, but I think you actually #ifdef'ed out too > much: > > > + > > +#if 0 > > if (TestClearPageDirty(page)) { > > radix_tree_tag_clear(&mapping->page_tree, > > page_index(page), PAGECACHE_TAG_DIRTY); > > @@ -866,11 +868,19 @@ int test_clear_page_dirty(struct page *p > > * page is locked, which pins the address_space > > */ > > if (mapping_cap_account_dirty(mapping)) { > > - page_mkclean(page); > > + int cleaned = page_mkclean(page); > > + if (!must_clean_ptes && cleaned){ > > + WARN_ON(1); > > + set_page_dirty(page); > > + } > > + > > dec_zone_page_state(page, NR_FILE_DIRTY); > > } > > return 1; > > } > > + > > +#endif > > + > > It was really just the _inner_ "if (mapping_cap_account_dirty(.." > statement that I meant you should remove. > > Can you try that too? I have file corruption: "Hash check on download completion found bad chunks, consider using "safe_sync"." diff --git a/fs/buffer.c b/fs/buffer.c index d1f1b54..263f88e 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *pag int ret = 0; BUG_ON(!PageLocked(page)); - if (PageWriteback(page)) + if (PageDirty(page) || PageWriteback(page)) return 0; if (mapping == NULL) { /* can this still happen? */ @@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *pag spin_lock(&mapping->private_lock); ret = drop_buffers(page, &buffers_to_free); spin_unlock(&mapping->private_lock); - if (ret) { - /* - * If the filesystem writes its buffers by hand (eg ext3) - * then we can have clean buffers against a dirty page. We - * clean the page here; otherwise later reattachment of buffers - * could encounter a non-uptodate page, which is unresolvable. - * This only applies in the rare case where try_to_free_buffers - * succeeds but the page is not freed. - * - * Also, during truncate, discard_buffer will have marked all - * the page's buffers clean. We discover that here and clean - * the page also. - */ - if (test_clear_page_dirty(page)) - task_io_account_cancelled_write(PAGE_CACHE_SIZE); - } out: if (buffers_to_free) { struct buffer_head *bh = buffers_to_free; diff --git a/fs/cifs/file.c b/fs/cifs/file.c index 0f05cab..2d8bbbb 100644 --- a/fs/cifs/file.c +++ b/fs/cifs/file.c @@ -1245,7 +1245,7 @@ retry: wait_on_page_writeback(page); if (PageWriteback(page) || - !test_clear_page_dirty(page)) { + !test_clear_page_dirty(page, 0)) { unlock_page(page); break; } diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 1387749..da2bdb1 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -484,7 +484,7 @@ static int fuse_commit_write(struct file spin_unlock(&fc->lock); if (offset == 0 && to == PAGE_CACHE_SIZE) { - clear_page_dirty(page); + clear_page_dirty(page, 0); SetPageUptodate(page); } } diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index ed2c223..9f82cd0 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -176,7 +176,7 @@ static int hugetlbfs_commit_write(struct static void truncate_huge_page(struct page *page) { - clear_page_dirty(page); + clear_page_dirty(page, 0); ClearPageUptodate(page); remove_from_page_cache(page); put_page(page); diff --git a/fs/jfs/jfs_metapage.c b/fs/jfs/jfs_metapage.c index b1a1c72..5e29b37 100644 --- a/fs/jfs/jfs_metapage.c +++ b/fs/jfs/jfs_metapage.c @@ -773,7 +773,7 @@ #if MPS_PER_PAGE == 1 /* Retest mp->count since we may have released page lock */ if (test_bit(META_discard, &mp->flag) && !mp->count) { - clear_page_dirty(page); + clear_page_dirty(page, 0); ClearPageUptodate(page); } #else diff --git a/fs/reiserfs/stree.c b/fs/reiserfs/stree.c index 47e7027..a97e198 100644 --- a/fs/reiserfs/stree.c +++ b/fs/reiserfs/stree.c @@ -1459,7 +1459,7 @@ static void unmap_buffers(struct page *p bh = next; } while (bh != head); if (PAGE_SIZE == bh->b_size) { - clear_page_dirty(page); + clear_page_dirty(page, 0); } } } diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c index b56eb75..44ac434 100644 --- a/fs/xfs/linux-2.6/xfs_aops.c +++ b/fs/xfs/linux-2.6/xfs_aops.c @@ -343,7 +343,7 @@ xfs_start_page_writeback( ASSERT(!PageWriteback(page)); set_page_writeback(page); if (clear_dirty) - clear_page_dirty(page); + clear_page_dirty(page, 0); unlock_page(page); if (!buffers) { end_page_writeback(page); diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 4830a3b..175ab3c 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -253,13 +253,13 @@ #define ClearPageUncached(page) clear_bi struct page; /* forward declaration */ -int test_clear_page_dirty(struct page *page); +int test_clear_page_dirty(struct page *page, int must_clean_ptes); int test_clear_page_writeback(struct page *page); int test_set_page_writeback(struct page *page); -static inline void clear_page_dirty(struct page *page) +static inline void clear_page_dirty(struct page *page, int must_clean_ptes) { - test_clear_page_dirty(page); + test_clear_page_dirty(page, must_clean_ptes); } static inline void set_page_writeback(struct page *page) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 237107c..4ff7f90 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -848,7 +848,7 @@ EXPORT_SYMBOL(set_page_dirty_lock); * Clear a page's dirty flag, while caring for dirty memory accounting. * Returns true if the page was previously dirty. */ -int test_clear_page_dirty(struct page *page) +int test_clear_page_dirty(struct page *page, int must_clean_ptes) { struct address_space *mapping = page_mapping(page); unsigned long flags; @@ -857,6 +857,7 @@ int test_clear_page_dirty(struct page *p return TestClearPageDirty(page); write_lock_irqsave(&mapping->tree_lock, flags); + if (TestClearPageDirty(page)) { radix_tree_tag_clear(&mapping->page_tree, page_index(page), PAGECACHE_TAG_DIRTY); @@ -865,12 +866,23 @@ int test_clear_page_dirty(struct page *p * We can continue to use `mapping' here because the * page is locked, which pins the address_space */ + +#if 0 + if (mapping_cap_account_dirty(mapping)) { - page_mkclean(page); + int cleaned = page_mkclean(page); + if (!must_clean_ptes && cleaned){ + WARN_ON(1); + set_page_dirty(page); + } + dec_zone_page_state(page, NR_FILE_DIRTY); } +#endif + return 1; } + write_unlock_irqrestore(&mapping->tree_lock, flags); return 0; } diff --git a/mm/rmap.c b/mm/rmap.c diff --git a/mm/truncate.c b/mm/truncate.c index 9bfb8e8..9a01d9e 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -70,7 +70,7 @@ truncate_complete_page(struct address_sp if (PagePrivate(page)) do_invalidatepage(page, 0); - if (test_clear_page_dirty(page)) + if (test_clear_page_dirty(page, 0)) task_io_account_cancelled_write(PAGE_CACHE_SIZE); ClearPageUptodate(page); ClearPageMappedToDisk(page); @@ -386,7 +386,7 @@ int invalidate_inode_pages2_range(struct PAGE_CACHE_SIZE, 0); } } - was_dirty = test_clear_page_dirty(page); + was_dirty = test_clear_page_dirty(page, 0); if (!invalidate_complete_page2(mapping, page)) { if (was_dirty) set_page_dirty(page); ^ permalink raw reply related [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-19 0:29 ` Andrei Popa @ 2006-12-19 0:57 ` Linus Torvalds 2006-12-19 1:21 ` Andrew Morton 2006-12-19 1:50 ` Andrei Popa 0 siblings, 2 replies; 311+ messages in thread From: Linus Torvalds @ 2006-12-19 0:57 UTC (permalink / raw) To: Andrei Popa Cc: Peter Zijlstra, Andrew Morton, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Tue, 19 Dec 2006, Andrei Popa wrote: > > > > > > nope, no file corruption at all. > > > > Ok. That's interesting, but I think you actually #ifdef'ed out too > > much: > > > > It was really just the _inner_ "if (mapping_cap_account_dirty(.." > > statement that I meant you should remove. > > > > Can you try that too? > > I have file corruption: "Hash check on download completion found bad > chunks, consider using "safe_sync"." Ok, that's interesting. So it doesn't seem to be the call to page_mkclean() itself that causes corruption. It looks like Peter's hunch that maybe there's some bug in PG_dirty handling _itself_ might be an idea.. And the reason it only started happening now is that it may just have been _hidden_ by the fact that while we kept the dirty bits in the page tables, we'd end up writing the dirty page _despite_ having lost the PG_dirty bit. So if it's some bad interaction between writable mappings and some other part of the system, we just didn't see it earlier, exactly because we had _lots_ of dirty bits, and it was enough that _one_ of them was right. If you didn't see corruption when you #ifdef'ed out too much of the "test_clean_page_dirty() function (the _whole_ TestClearPageDirty() if-statement), but you get it when you just comment out the stuff that does the page_mkclean(), that's interesting. I'm left lookin gat the "radix_tree_tag_clear()" in test_clear_page_dirty(). What happens if you only ifdef out that single thing? The actual page-cleaning functions make sure to only clear the TAG_DIRTY bit _after_ the page has been marked for writeback. Is there some ordering constraint there, perhaps? I'm really reaching here. I'm trying to see the pattern, and I'm not seeing it. I'm asking you to test things just to get more of a feel for what triggers the failure, than because I actually have any kind of idea of what the heck is going on. Andrew, Nick, Hugh - any ideas? Linus ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-19 0:57 ` Linus Torvalds @ 2006-12-19 1:21 ` Andrew Morton 2006-12-19 1:44 ` Andrei Popa 2006-12-19 1:50 ` Andrei Popa 1 sibling, 1 reply; 311+ messages in thread From: Andrew Morton @ 2006-12-19 1:21 UTC (permalink / raw) To: Linus Torvalds Cc: Andrei Popa, Peter Zijlstra, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Mon, 18 Dec 2006 16:57:30 -0800 (PST) Linus Torvalds <torvalds@osdl.org> wrote: > What happens if you only ifdef out that single thing? > > The actual page-cleaning functions make sure to only clear the TAG_DIRTY > bit _after_ the page has been marked for writeback. Is there some ordering > constraint there, perhaps? > > I'm really reaching here. I'm trying to see the pattern, and I'm not > seeing it. I'm asking you to test things just to get more of a feel for > what triggers the failure, than because I actually have any kind of idea > of what the heck is going on. > > Andrew, Nick, Hugh - any ideas? If all of test_clear_page_dirty() has been commented out then the page will never become clean hence will never fall out of pagecache, so unless Andrei is doing a reboot before checking for corruption, perhaps the underlying data on-disk is incorrect, but we can't see it. Andrei, how _are_ you running this test? What's the exact sequence of steps? In particular, are you doing anything which would cause the corrupted file to be evicted from memory, thus forcing a read from disk? Such as unmounting and then remounting the filesystem? The point of my question is to check that the data is really incorrect on-disk, or whether it is incorrect in pagecache. Also, it'd be useful if you could determine whether the bug appears with the ext2 filesystem: do s/ext3/ext2/ in /etc/fstab, or boot with rootfstype=ext2 if it's the root filesystem. Thanks. ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-19 1:21 ` Andrew Morton @ 2006-12-19 1:44 ` Andrei Popa 2006-12-19 1:54 ` Andrew Morton 0 siblings, 1 reply; 311+ messages in thread From: Andrei Popa @ 2006-12-19 1:44 UTC (permalink / raw) To: Andrew Morton Cc: Linus Torvalds, Peter Zijlstra, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Mon, 2006-12-18 at 17:21 -0800, Andrew Morton wrote: > On Mon, 18 Dec 2006 16:57:30 -0800 (PST) > Linus Torvalds <torvalds@osdl.org> wrote: > > > What happens if you only ifdef out that single thing? > > > > The actual page-cleaning functions make sure to only clear the TAG_DIRTY > > bit _after_ the page has been marked for writeback. Is there some ordering > > constraint there, perhaps? > > > > I'm really reaching here. I'm trying to see the pattern, and I'm not > > seeing it. I'm asking you to test things just to get more of a feel for > > what triggers the failure, than because I actually have any kind of idea > > of what the heck is going on. > > > > Andrew, Nick, Hugh - any ideas? > > If all of test_clear_page_dirty() has been commented out then the page will > never become clean hence will never fall out of pagecache, so unless Andrei > is doing a reboot before checking for corruption, perhaps the underlying > data on-disk is incorrect, but we can't see it. if I do a sync and echo 1 > /proc/sys/vm/drop_caches does the reboot is still necesary ? > > Andrei, how _are_ you running this test? What's the exact sequence of steps? > > In particular, are you doing anything which would cause the corrupted file > to be evicted from memory, thus forcing a read from disk? Such as > unmounting and then remounting the filesystem? I boot linux, I start rtorrent and start the download, while it's downloading I start evolution and i check my mail(my mbox is very large, several hundered megabytes), I close evolution(I use evolution just to have another application witch uses the filesystem and the memory), I start evolution again. I start firefox. The download is complete. Rtorrent says if the hash is good or not. I do a "unrar t qwe.rar" to test that all 84 downloaded rar files are ok and see the result. > > The point of my question is to check that the data is really incorrect > on-disk, or whether it is incorrect in pagecache. > > Also, it'd be useful if you could determine whether the bug appears with > the ext2 filesystem: do s/ext3/ext2/ in /etc/fstab, or boot with > rootfstype=ext2 if it's the root filesystem. I will test. > > Thanks. ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-19 1:44 ` Andrei Popa @ 2006-12-19 1:54 ` Andrew Morton 2006-12-19 2:04 ` Andrei Popa 2006-12-19 8:05 ` Andrei Popa 0 siblings, 2 replies; 311+ messages in thread From: Andrew Morton @ 2006-12-19 1:54 UTC (permalink / raw) To: andrei.popa Cc: Linus Torvalds, Peter Zijlstra, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Tue, 19 Dec 2006 03:44:51 +0200 Andrei Popa <andrei.popa@i-neo.ro> wrote: > On Mon, 2006-12-18 at 17:21 -0800, Andrew Morton wrote: > > On Mon, 18 Dec 2006 16:57:30 -0800 (PST) > > Linus Torvalds <torvalds@osdl.org> wrote: > > > > > What happens if you only ifdef out that single thing? > > > > > > The actual page-cleaning functions make sure to only clear the TAG_DIRTY > > > bit _after_ the page has been marked for writeback. Is there some ordering > > > constraint there, perhaps? > > > > > > I'm really reaching here. I'm trying to see the pattern, and I'm not > > > seeing it. I'm asking you to test things just to get more of a feel for > > > what triggers the failure, than because I actually have any kind of idea > > > of what the heck is going on. > > > > > > Andrew, Nick, Hugh - any ideas? > > > > If all of test_clear_page_dirty() has been commented out then the page will > > never become clean hence will never fall out of pagecache, so unless Andrei > > is doing a reboot before checking for corruption, perhaps the underlying > > data on-disk is incorrect, but we can't see it. > > if I do a sync and echo 1 > /proc/sys/vm/drop_caches OK, that works. > does the reboot is > still necesary ? It might be necessary to reboot in this case - if we're leaving the pagecache dirty, writing to drop_caches won't remove it. And you probably won't be able to get a clean reboot either. > > > > Andrei, how _are_ you running this test? What's the exact sequence of steps? > > > > In particular, are you doing anything which would cause the corrupted file > > to be evicted from memory, thus forcing a read from disk? Such as > > unmounting and then remounting the filesystem? > > I boot linux, I start rtorrent and start the download, while it's > downloading I start evolution and i check my mail(my mbox is very large, > several hundered megabytes), I close evolution(I use evolution just to > have another application witch uses the filesystem and the memory), I > start evolution again. I start firefox. The download is complete. > Rtorrent says if the hash is good or not. I do a "unrar t qwe.rar" to > test that all 84 downloaded rar files are ok and see the result. > > > > > The point of my question is to check that the data is really incorrect > > on-disk, or whether it is incorrect in pagecache. > > > > Also, it'd be useful if you could determine whether the bug appears with > > the ext2 filesystem: do s/ext3/ext2/ in /etc/fstab, or boot with > > rootfstype=ext2 if it's the root filesystem. > > I will test. ok, thanks. ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-19 1:54 ` Andrew Morton @ 2006-12-19 2:04 ` Andrei Popa 2006-12-19 8:05 ` Andrei Popa 1 sibling, 0 replies; 311+ messages in thread From: Andrei Popa @ 2006-12-19 2:04 UTC (permalink / raw) To: Andrew Morton Cc: Linus Torvalds, Peter Zijlstra, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr > > > If all of test_clear_page_dirty() has been commented out then the page will > > > never become clean hence will never fall out of pagecache, so unless Andrei > > > is doing a reboot before checking for corruption, perhaps the underlying > > > data on-disk is incorrect, but we can't see it. > > > > if I do a sync and echo 1 > /proc/sys/vm/drop_caches > > OK, that works. > > > does the reboot is > > still necesary ? > > It might be necessary to reboot in this case - if we're leaving the > pagecache dirty, writing to drop_caches won't remove it. And you probably > won't be able to get a clean reboot either. > > > > > > > Andrei, how _are_ you running this test? What's the exact sequence of steps? > > > > > > In particular, are you doing anything which would cause the corrupted file > > > to be evicted from memory, thus forcing a read from disk? Such as > > > unmounting and then remounting the filesystem? > > > > I boot linux, I start rtorrent and start the download, while it's > > downloading I start evolution and i check my mail(my mbox is very large, > > several hundered megabytes), I close evolution(I use evolution just to > > have another application witch uses the filesystem and the memory), I > > start evolution again. I start firefox. The download is complete. > > Rtorrent says if the hash is good or not. I do a "unrar t qwe.rar" to > > test that all 84 downloaded rar files are ok and see the result. > > > > > > > > The point of my question is to check that the data is really incorrect > > > on-disk, or whether it is incorrect in pagecache. I rebooted and the files are still broken after reboot(tested twice) so the data is incorrect on disk. > > > > > > Also, it'd be useful if you could determine whether the bug appears with > > > the ext2 filesystem: do s/ext3/ext2/ in /etc/fstab, or boot with > > > rootfstype=ext2 if it's the root filesystem. > > > > I will test. Will test In a couple of hours, I have some work to do... > > ok, thanks. ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-19 1:54 ` Andrew Morton 2006-12-19 2:04 ` Andrei Popa @ 2006-12-19 8:05 ` Andrei Popa 2006-12-19 8:24 ` Andrew Morton 1 sibling, 1 reply; 311+ messages in thread From: Andrei Popa @ 2006-12-19 8:05 UTC (permalink / raw) To: Andrew Morton Cc: Linus Torvalds, Peter Zijlstra, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr > > > Also, it'd be useful if you could determine whether the bug appears with > > > the ext2 filesystem: do s/ext3/ext2/ in /etc/fstab, or boot with > > > rootfstype=ext2 if it's the root filesystem. > > I fave file corruption. ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-19 8:05 ` Andrei Popa @ 2006-12-19 8:24 ` Andrew Morton 2006-12-19 8:34 ` Pekka Enberg 2006-12-19 9:13 ` Marc Haber 0 siblings, 2 replies; 311+ messages in thread From: Andrew Morton @ 2006-12-19 8:24 UTC (permalink / raw) To: andrei.popa Cc: Linus Torvalds, Peter Zijlstra, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Tue, 19 Dec 2006 10:05:03 +0200 Andrei Popa <andrei.popa@i-neo.ro> wrote: > > > > Also, it'd be useful if you could determine whether the bug appears with > > > > the ext2 filesystem: do s/ext3/ext2/ in /etc/fstab, or boot with > > > > rootfstype=ext2 if it's the root filesystem. > > > > I fave file corruption. Wow. I didn't expect that, because Mark Haber reported that ext3's data=writeback fixed it. Maybe he didn't run it for long enough? ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-19 8:24 ` Andrew Morton @ 2006-12-19 8:34 ` Pekka Enberg 2006-12-19 9:13 ` Marc Haber 1 sibling, 0 replies; 311+ messages in thread From: Pekka Enberg @ 2006-12-19 8:34 UTC (permalink / raw) To: Andrew Morton Cc: andrei.popa, Linus Torvalds, Peter Zijlstra, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On 12/19/06, Andrew Morton <akpm@osdl.org> wrote: > Wow. I didn't expect that, because Mark Haber reported that ext3's data=writeback > fixed it. Maybe he didn't run it for long enough? I don't think it did fix it for Mark: http://marc.theaimsgroup.com/?l=linux-kernel&m=116625777306843&w=2 ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-19 8:24 ` Andrew Morton 2006-12-19 8:34 ` Pekka Enberg @ 2006-12-19 9:13 ` Marc Haber 1 sibling, 0 replies; 311+ messages in thread From: Marc Haber @ 2006-12-19 9:13 UTC (permalink / raw) To: Andrew Morton Cc: andrei.popa, Linus Torvalds, Peter Zijlstra, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Martin Michlmayr On Tue, Dec 19, 2006 at 12:24:16AM -0800, Andrew Morton wrote: > Wow. I didn't expect that, because Mark Haber reported that ext3's data=writeback > fixed it. Maybe he didn't run it for long enough? My test case is Debian's "aptitude update" running once an hour, and it was always the same file getting corrupted. With 2.6.19, I had this corruption like every third hour (but -only- if run from cron, running from a shell was always fine), data=writeback made the issue disappear for about two days before I booted into 2.6.19.1 without data=writeback (defaults chosen then), after which the issue only shows up like every other day. So, I feel like out of the loop since rtorrent seems much better in reproducing this. I notice, though, that both aptitude and rtorrent do downloads from the net, so there might be a relation to tcp/ip and/or the network driver. My box has a Linksys NC100 network card running with the tulip driver. Greetings Marc -- ----------------------------------------------------------------------------- Marc Haber | "I don't trust Computers. They | Mailadresse im Header Mannheim, Germany | lose things." Winona Ryder | Fon: *49 621 72739834 Nordisch by Nature | How to make an American Quilt | Fax: *49 621 72739835 ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-19 0:57 ` Linus Torvalds 2006-12-19 1:21 ` Andrew Morton @ 2006-12-19 1:50 ` Andrei Popa 1 sibling, 0 replies; 311+ messages in thread From: Andrei Popa @ 2006-12-19 1:50 UTC (permalink / raw) To: Linus Torvalds Cc: Peter Zijlstra, Andrew Morton, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Mon, 2006-12-18 at 16:57 -0800, Linus Torvalds wrote: > > On Tue, 19 Dec 2006, Andrei Popa wrote: > > > > > > > > nope, no file corruption at all. > > > > > > Ok. That's interesting, but I think you actually #ifdef'ed out too > > > much: > > > > > > It was really just the _inner_ "if (mapping_cap_account_dirty(.." > > > statement that I meant you should remove. > > > > > > Can you try that too? > > > > I have file corruption: "Hash check on download completion found bad > > chunks, consider using "safe_sync"." > > Ok, that's interesting. > > So it doesn't seem to be the call to page_mkclean() itself that causes > corruption. It looks like Peter's hunch that maybe there's some bug in > PG_dirty handling _itself_ might be an idea.. > > And the reason it only started happening now is that it may just have been > _hidden_ by the fact that while we kept the dirty bits in the page tables, > we'd end up writing the dirty page _despite_ having lost the PG_dirty bit. > So if it's some bad interaction between writable mappings and some other > part of the system, we just didn't see it earlier, exactly because we had > _lots_ of dirty bits, and it was enough that _one_ of them was right. > > If you didn't see corruption when you #ifdef'ed out too much of the > "test_clean_page_dirty() function (the _whole_ TestClearPageDirty() > if-statement), but you get it when you just comment out the stuff that > does the page_mkclean(), that's interesting. > > I'm left lookin gat the "radix_tree_tag_clear()" in > test_clear_page_dirty(). > > What happens if you only ifdef out that single thing? I have file corruption. > > The actual page-cleaning functions make sure to only clear the TAG_DIRTY > bit _after_ the page has been marked for writeback. Is there some ordering > constraint there, perhaps? > > I'm really reaching here. I'm trying to see the pattern, and I'm not > seeing it. I'm asking you to test things just to get more of a feel for > what triggers the failure, than because I actually have any kind of idea > of what the heck is going on. > > Andrew, Nick, Hugh - any ideas? > > Linus diff --git a/fs/buffer.c b/fs/buffer.c index d1f1b54..263f88e 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *pag int ret = 0; BUG_ON(!PageLocked(page)); - if (PageWriteback(page)) + if (PageDirty(page) || PageWriteback(page)) return 0; if (mapping == NULL) { /* can this still happen? */ @@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *pag spin_lock(&mapping->private_lock); ret = drop_buffers(page, &buffers_to_free); spin_unlock(&mapping->private_lock); - if (ret) { - /* - * If the filesystem writes its buffers by hand (eg ext3) - * then we can have clean buffers against a dirty page. We - * clean the page here; otherwise later reattachment of buffers - * could encounter a non-uptodate page, which is unresolvable. - * This only applies in the rare case where try_to_free_buffers - * succeeds but the page is not freed. - * - * Also, during truncate, discard_buffer will have marked all - * the page's buffers clean. We discover that here and clean - * the page also. - */ - if (test_clear_page_dirty(page)) - task_io_account_cancelled_write(PAGE_CACHE_SIZE); - } out: if (buffers_to_free) { struct buffer_head *bh = buffers_to_free; diff --git a/fs/cifs/file.c b/fs/cifs/file.c index 0f05cab..2d8bbbb 100644 --- a/fs/cifs/file.c +++ b/fs/cifs/file.c @@ -1245,7 +1245,7 @@ retry: wait_on_page_writeback(page); if (PageWriteback(page) || - !test_clear_page_dirty(page)) { + !test_clear_page_dirty(page, 0)) { unlock_page(page); break; } diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 1387749..da2bdb1 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -484,7 +484,7 @@ static int fuse_commit_write(struct file spin_unlock(&fc->lock); if (offset == 0 && to == PAGE_CACHE_SIZE) { - clear_page_dirty(page); + clear_page_dirty(page, 0); SetPageUptodate(page); } } diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index ed2c223..9f82cd0 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -176,7 +176,7 @@ static int hugetlbfs_commit_write(struct static void truncate_huge_page(struct page *page) { - clear_page_dirty(page); + clear_page_dirty(page, 0); ClearPageUptodate(page); remove_from_page_cache(page); put_page(page); diff --git a/fs/jfs/jfs_metapage.c b/fs/jfs/jfs_metapage.c index b1a1c72..5e29b37 100644 --- a/fs/jfs/jfs_metapage.c +++ b/fs/jfs/jfs_metapage.c @@ -773,7 +773,7 @@ #if MPS_PER_PAGE == 1 /* Retest mp->count since we may have released page lock */ if (test_bit(META_discard, &mp->flag) && !mp->count) { - clear_page_dirty(page); + clear_page_dirty(page, 0); ClearPageUptodate(page); } #else diff --git a/fs/reiserfs/stree.c b/fs/reiserfs/stree.c index 47e7027..a97e198 100644 --- a/fs/reiserfs/stree.c +++ b/fs/reiserfs/stree.c @@ -1459,7 +1459,7 @@ static void unmap_buffers(struct page *p bh = next; } while (bh != head); if (PAGE_SIZE == bh->b_size) { - clear_page_dirty(page); + clear_page_dirty(page, 0); } } } diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c index b56eb75..44ac434 100644 --- a/fs/xfs/linux-2.6/xfs_aops.c +++ b/fs/xfs/linux-2.6/xfs_aops.c @@ -343,7 +343,7 @@ xfs_start_page_writeback( ASSERT(!PageWriteback(page)); set_page_writeback(page); if (clear_dirty) - clear_page_dirty(page); + clear_page_dirty(page, 0); unlock_page(page); if (!buffers) { end_page_writeback(page); diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 4830a3b..175ab3c 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -253,13 +253,13 @@ #define ClearPageUncached(page) clear_bi struct page; /* forward declaration */ -int test_clear_page_dirty(struct page *page); +int test_clear_page_dirty(struct page *page, int must_clean_ptes); int test_clear_page_writeback(struct page *page); int test_set_page_writeback(struct page *page); -static inline void clear_page_dirty(struct page *page) +static inline void clear_page_dirty(struct page *page, int must_clean_ptes) { - test_clear_page_dirty(page); + test_clear_page_dirty(page, must_clean_ptes); } static inline void set_page_writeback(struct page *page) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 237107c..4ff7f90 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -848,7 +848,7 @@ EXPORT_SYMBOL(set_page_dirty_lock); * Clear a page's dirty flag, while caring for dirty memory accounting. * Returns true if the page was previously dirty. */ -int test_clear_page_dirty(struct page *page) +int test_clear_page_dirty(struct page *page, int must_clean_ptes) { struct address_space *mapping = page_mapping(page); unsigned long flags; @@ -857,6 +857,7 @@ int test_clear_page_dirty(struct page *p return TestClearPageDirty(page); write_lock_irqsave(&mapping->tree_lock, flags); + if (TestClearPageDirty(page)) { radix_tree_tag_clear(&mapping->page_tree, page_index(page), PAGECACHE_TAG_DIRTY); @@ -865,12 +866,23 @@ int test_clear_page_dirty(struct page *p * We can continue to use `mapping' here because the * page is locked, which pins the address_space */ + +#if 0 + if (mapping_cap_account_dirty(mapping)) { - page_mkclean(page); + int cleaned = page_mkclean(page); + if (!must_clean_ptes && cleaned){ + WARN_ON(1); + set_page_dirty(page); + } + dec_zone_page_state(page, NR_FILE_DIRTY); } +#endif + return 1; } + write_unlock_irqrestore(&mapping->tree_lock, flags); return 0; } diff --git a/mm/rmap.c b/mm/rmap.c diff --git a/mm/truncate.c b/mm/truncate.c index 9bfb8e8..9a01d9e 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -70,7 +70,7 @@ truncate_complete_page(struct address_sp if (PagePrivate(page)) do_invalidatepage(page, 0); - if (test_clear_page_dirty(page)) + if (test_clear_page_dirty(page, 0)) task_io_account_cancelled_write(PAGE_CACHE_SIZE); ClearPageUptodate(page); ClearPageMappedToDisk(page); @@ -386,7 +386,7 @@ int invalidate_inode_pages2_range(struct PAGE_CACHE_SIZE, 0); } } - was_dirty = test_clear_page_dirty(page); + was_dirty = test_clear_page_dirty(page, 0); if (!invalidate_complete_page2(mapping, page)) { if (was_dirty) set_page_dirty(page); diff --git a/fs/buffer.c b/fs/buffer.c index d1f1b54..263f88e 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *pag int ret = 0; BUG_ON(!PageLocked(page)); - if (PageWriteback(page)) + if (PageDirty(page) || PageWriteback(page)) return 0; if (mapping == NULL) { /* can this still happen? */ @@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *pag spin_lock(&mapping->private_lock); ret = drop_buffers(page, &buffers_to_free); spin_unlock(&mapping->private_lock); - if (ret) { - /* - * If the filesystem writes its buffers by hand (eg ext3) - * then we can have clean buffers against a dirty page. We - * clean the page here; otherwise later reattachment of buffers - * could encounter a non-uptodate page, which is unresolvable. - * This only applies in the rare case where try_to_free_buffers - * succeeds but the page is not freed. - * - * Also, during truncate, discard_buffer will have marked all - * the page's buffers clean. We discover that here and clean - * the page also. - */ - if (test_clear_page_dirty(page)) - task_io_account_cancelled_write(PAGE_CACHE_SIZE); - } out: if (buffers_to_free) { struct buffer_head *bh = buffers_to_free; diff --git a/fs/cifs/file.c b/fs/cifs/file.c index 0f05cab..2d8bbbb 100644 --- a/fs/cifs/file.c +++ b/fs/cifs/file.c @@ -1245,7 +1245,7 @@ retry: wait_on_page_writeback(page); if (PageWriteback(page) || - !test_clear_page_dirty(page)) { + !test_clear_page_dirty(page, 0)) { unlock_page(page); break; } diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 1387749..da2bdb1 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -484,7 +484,7 @@ static int fuse_commit_write(struct file spin_unlock(&fc->lock); if (offset == 0 && to == PAGE_CACHE_SIZE) { - clear_page_dirty(page); + clear_page_dirty(page, 0); SetPageUptodate(page); } } diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index ed2c223..9f82cd0 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -176,7 +176,7 @@ static int hugetlbfs_commit_write(struct static void truncate_huge_page(struct page *page) { - clear_page_dirty(page); + clear_page_dirty(page, 0); ClearPageUptodate(page); remove_from_page_cache(page); put_page(page); diff --git a/fs/jfs/jfs_metapage.c b/fs/jfs/jfs_metapage.c index b1a1c72..5e29b37 100644 --- a/fs/jfs/jfs_metapage.c +++ b/fs/jfs/jfs_metapage.c @@ -773,7 +773,7 @@ #if MPS_PER_PAGE == 1 /* Retest mp->count since we may have released page lock */ if (test_bit(META_discard, &mp->flag) && !mp->count) { - clear_page_dirty(page); + clear_page_dirty(page, 0); ClearPageUptodate(page); } #else diff --git a/fs/reiserfs/stree.c b/fs/reiserfs/stree.c index 47e7027..a97e198 100644 --- a/fs/reiserfs/stree.c +++ b/fs/reiserfs/stree.c @@ -1459,7 +1459,7 @@ static void unmap_buffers(struct page *p bh = next; } while (bh != head); if (PAGE_SIZE == bh->b_size) { - clear_page_dirty(page); + clear_page_dirty(page, 0); } } } diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c index b56eb75..44ac434 100644 --- a/fs/xfs/linux-2.6/xfs_aops.c +++ b/fs/xfs/linux-2.6/xfs_aops.c @@ -343,7 +343,7 @@ xfs_start_page_writeback( ASSERT(!PageWriteback(page)); set_page_writeback(page); if (clear_dirty) - clear_page_dirty(page); + clear_page_dirty(page, 0); unlock_page(page); if (!buffers) { end_page_writeback(page); diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 4830a3b..175ab3c 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -253,13 +253,13 @@ #define ClearPageUncached(page) clear_bi struct page; /* forward declaration */ -int test_clear_page_dirty(struct page *page); +int test_clear_page_dirty(struct page *page, int must_clean_ptes); int test_clear_page_writeback(struct page *page); int test_set_page_writeback(struct page *page); -static inline void clear_page_dirty(struct page *page) +static inline void clear_page_dirty(struct page *page, int must_clean_ptes) { - test_clear_page_dirty(page); + test_clear_page_dirty(page, must_clean_ptes); } static inline void set_page_writeback(struct page *page) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 237107c..e6524a6 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -848,7 +848,7 @@ EXPORT_SYMBOL(set_page_dirty_lock); * Clear a page's dirty flag, while caring for dirty memory accounting. * Returns true if the page was previously dirty. */ -int test_clear_page_dirty(struct page *page) +int test_clear_page_dirty(struct page *page, int must_clean_ptes) { struct address_space *mapping = page_mapping(page); unsigned long flags; @@ -857,20 +857,35 @@ int test_clear_page_dirty(struct page *p return TestClearPageDirty(page); write_lock_irqsave(&mapping->tree_lock, flags); + if (TestClearPageDirty(page)) { + +#if 0 + radix_tree_tag_clear(&mapping->page_tree, page_index(page), PAGECACHE_TAG_DIRTY); + +#endif + write_unlock_irqrestore(&mapping->tree_lock, flags); /* * We can continue to use `mapping' here because the * page is locked, which pins the address_space */ + + if (mapping_cap_account_dirty(mapping)) { - page_mkclean(page); + int cleaned = page_mkclean(page); + if (!must_clean_ptes && cleaned){ + WARN_ON(1); + set_page_dirty(page); + } + dec_zone_page_state(page, NR_FILE_DIRTY); } return 1; } + write_unlock_irqrestore(&mapping->tree_lock, flags); return 0; } diff --git a/mm/rmap.c b/mm/rmap.c diff --git a/mm/truncate.c b/mm/truncate.c index 9bfb8e8..9a01d9e 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -70,7 +70,7 @@ truncate_complete_page(struct address_sp if (PagePrivate(page)) do_invalidatepage(page, 0); - if (test_clear_page_dirty(page)) + if (test_clear_page_dirty(page, 0)) task_io_account_cancelled_write(PAGE_CACHE_SIZE); ClearPageUptodate(page); ClearPageMappedToDisk(page); @@ -386,7 +386,7 @@ int invalidate_inode_pages2_range(struct PAGE_CACHE_SIZE, 0); } } - was_dirty = test_clear_page_dirty(page); + was_dirty = test_clear_page_dirty(page, 0); if (!invalidate_complete_page2(mapping, page)) { if (was_dirty) set_page_dirty(page); ^ permalink raw reply related [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-18 23:48 ` Andrei Popa 2006-12-19 0:04 ` Linus Torvalds @ 2006-12-19 1:03 ` Gene Heskett 1 sibling, 0 replies; 311+ messages in thread From: Gene Heskett @ 2006-12-19 1:03 UTC (permalink / raw) To: linux-kernel, andrei.popa Cc: Linus Torvalds, Peter Zijlstra, Andrew Morton, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Monday 18 December 2006 18:48, Andrei Popa wrote: >On Mon, 2006-12-18 at 14:32 -0800, Linus Torvalds wrote: >> On Mon, 18 Dec 2006, Andrei Popa wrote: >> > > This should be fairly easy to test: just change every single ", 1" >> > > case in the patch to ", 0". >> > > >> > > What happens for you in that case? >> > >> > I have file corruption. >> >> Magic. And btw, _thanks_ for being such a great tester. >> >> So now I have one more thng for you to try, it you can bother: >> >> There's exactly two call sites that call "page_mkclean()" (an dthat is >> the only thing in turn that calls "page_mkclean_one()", which we >> already determined will cause the corruption). >> >> Both of them do >> >> if (mapping_cap_account_dirty(mapping)) { >> .. >> >> things, although they do slightly different things inside that if in >> your patched kernel. >> >> Can you just TOTALLY DISABLE that case for the test_clear_page_dirty() >> case? Just do an "#if 0 .. #endif" around that whole if-statement, >> leaving the _only_ thing that actually calls "page_mkclean()" to be >> the "clear_page_dirty_for_io()" call. >> >> Do you still see corruption? > >nope, no file corruption at all. > Goody I says to nobody in particular, I'll go build this... > >diff --git a/fs/buffer.c b/fs/buffer.c >index d1f1b54..263f88e 100644 >--- a/fs/buffer.c >+++ b/fs/buffer.c >@@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *pag > int ret = 0; > > BUG_ON(!PageLocked(page)); >- if (PageWriteback(page)) >+ if (PageDirty(page) || PageWriteback(page)) > return 0; > > if (mapping == NULL) { /* can this still happen? */ >@@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *pag > spin_lock(&mapping->private_lock); > ret = drop_buffers(page, &buffers_to_free); > spin_unlock(&mapping->private_lock); >- if (ret) { >- /* >- * If the filesystem writes its buffers by hand (eg ext3) >- * then we can have clean buffers against a dirty page. We >- * clean the page here; otherwise later reattachment of buffers >- * could encounter a non-uptodate page, which is unresolvable. >- * This only applies in the rare case where try_to_free_buffers >- * succeeds but the page is not freed. >- * >- * Also, during truncate, discard_buffer will have marked all >- * the page's buffers clean. We discover that here and clean >- * the page also. >- */ >- if (test_clear_page_dirty(page)) >- task_io_account_cancelled_write(PAGE_CACHE_SIZE); >- } > out: > if (buffers_to_free) { > struct buffer_head *bh = buffers_to_free; >diff --git a/fs/cifs/file.c b/fs/cifs/file.c >index 0f05cab..2d8bbbb 100644 >--- a/fs/cifs/file.c >+++ b/fs/cifs/file.c >@@ -1245,7 +1245,7 @@ retry: > wait_on_page_writeback(page); > > if (PageWriteback(page) || >- !test_clear_page_dirty(page)) { >+ !test_clear_page_dirty(page, 0)) { > unlock_page(page); > break; > } >diff --git a/fs/fuse/file.c b/fs/fuse/file.c >index 1387749..da2bdb1 100644 >--- a/fs/fuse/file.c >+++ b/fs/fuse/file.c >@@ -484,7 +484,7 @@ static int fuse_commit_write(struct file > spin_unlock(&fc->lock); > > if (offset == 0 && to == PAGE_CACHE_SIZE) { >- clear_page_dirty(page); >+ clear_page_dirty(page, 0); > SetPageUptodate(page); > } > } >diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c >index ed2c223..9f82cd0 100644 >--- a/fs/hugetlbfs/inode.c >+++ b/fs/hugetlbfs/inode.c >@@ -176,7 +176,7 @@ static int hugetlbfs_commit_write(struct > > static void truncate_huge_page(struct page *page) > { >- clear_page_dirty(page); >+ clear_page_dirty(page, 0); > ClearPageUptodate(page); > remove_from_page_cache(page); > put_page(page); >diff --git a/fs/jfs/jfs_metapage.c b/fs/jfs/jfs_metapage.c >index b1a1c72..5e29b37 100644 >--- a/fs/jfs/jfs_metapage.c >+++ b/fs/jfs/jfs_metapage.c >@@ -773,7 +773,7 @@ #if MPS_PER_PAGE == 1 > > /* Retest mp->count since we may have released page lock */ > if (test_bit(META_discard, &mp->flag) && !mp->count) { >- clear_page_dirty(page); >+ clear_page_dirty(page, 0); > ClearPageUptodate(page); > } > #else >diff --git a/fs/reiserfs/stree.c b/fs/reiserfs/stree.c >index 47e7027..a97e198 100644 >--- a/fs/reiserfs/stree.c >+++ b/fs/reiserfs/stree.c >@@ -1459,7 +1459,7 @@ static void unmap_buffers(struct page *p > bh = next; > } while (bh != head); > if (PAGE_SIZE == bh->b_size) { >- clear_page_dirty(page); >+ clear_page_dirty(page, 0); > } > } > } >diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c >index b56eb75..44ac434 100644 >--- a/fs/xfs/linux-2.6/xfs_aops.c >+++ b/fs/xfs/linux-2.6/xfs_aops.c >@@ -343,7 +343,7 @@ xfs_start_page_writeback( > ASSERT(!PageWriteback(page)); > set_page_writeback(page); > if (clear_dirty) >- clear_page_dirty(page); >+ clear_page_dirty(page, 0); > unlock_page(page); > if (!buffers) { > end_page_writeback(page); >diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h >index 4830a3b..175ab3c 100644 >--- a/include/linux/page-flags.h >+++ b/include/linux/page-flags.h >@@ -253,13 +253,13 @@ #define ClearPageUncached(page) clear_bi > > struct page; /* forward declaration */ > >-int test_clear_page_dirty(struct page *page); >+int test_clear_page_dirty(struct page *page, int must_clean_ptes); > int test_clear_page_writeback(struct page *page); > int test_set_page_writeback(struct page *page); > >-static inline void clear_page_dirty(struct page *page) >+static inline void clear_page_dirty(struct page *page, int >must_clean_ptes) above looks wrapped to me so I fixed it to one line > { >- test_clear_page_dirty(page); >+ test_clear_page_dirty(page, must_clean_ptes); > } > > static inline void set_page_writeback(struct page *page) >diff --git a/mm/page-writeback.c b/mm/page-writeback.c >index 237107c..f2a157d 100644 >--- a/mm/page-writeback.c >+++ b/mm/page-writeback.c >@@ -848,7 +848,7 @@ EXPORT_SYMBOL(set_page_dirty_lock); > * Clear a page's dirty flag, while caring for dirty memory >accounting. Likewise here, malformed patch otherwise > * Returns true if the page was previously dirty. > */ >-int test_clear_page_dirty(struct page *page) >+int test_clear_page_dirty(struct page *page, int must_clean_ptes) > { > struct address_space *mapping = page_mapping(page); > unsigned long flags; >@@ -857,6 +857,8 @@ int test_clear_page_dirty(struct page *p > return TestClearPageDirty(page); > > write_lock_irqsave(&mapping->tree_lock, flags); >+ >+#if 0 > if (TestClearPageDirty(page)) { > radix_tree_tag_clear(&mapping->page_tree, > page_index(page), PAGECACHE_TAG_DIRTY); >@@ -866,11 +868,19 @@ int test_clear_page_dirty(struct page *p > * page is locked, which pins the address_space > */ > if (mapping_cap_account_dirty(mapping)) { >- page_mkclean(page); >+ int cleaned = page_mkclean(page); >+ if (!must_clean_ptes && cleaned){ >+ WARN_ON(1); >+ set_page_dirty(page); >+ } >+ > dec_zone_page_state(page, NR_FILE_DIRTY); > } > return 1; > } >+ >+#endif >+ > write_unlock_irqrestore(&mapping->tree_lock, flags); > return 0; > } >diff --git a/mm/rmap.c b/mm/rmap.c >diff --git a/mm/truncate.c b/mm/truncate.c >index 9bfb8e8..9a01d9e 100644 >--- a/mm/truncate.c >+++ b/mm/truncate.c >@@ -70,7 +70,7 @@ truncate_complete_page(struct address_sp > if (PagePrivate(page)) > do_invalidatepage(page, 0); > >- if (test_clear_page_dirty(page)) >+ if (test_clear_page_dirty(page, 0)) > task_io_account_cancelled_write(PAGE_CACHE_SIZE); > ClearPageUptodate(page); > ClearPageMappedToDisk(page); >@@ -386,7 +386,7 @@ int invalidate_inode_pages2_range(struct > PAGE_CACHE_SIZE, 0); > } > } >- was_dirty = test_clear_page_dirty(page); >+ was_dirty = test_clear_page_dirty(page, 0); > if (!invalidate_complete_page2(mapping, page)) { > if (was_dirty) > set_page_dirty(page); > I think I must have screwed the moose. Following along in this thread, I'd patched things back and forth till I figured I'd better do a fresh tree, so starting with the full 2.6.19 tarball, I applied the 2.6.20-rc1 patch, then the above patch, which should be the only thing different from what I'm running right now, which is the commented line in rmap.c, otherwise as it unpacked. But: In file included from include/linux/mm.h:230, from include/linux/rmap.h:10, from init/main.c:47: include/linux/page-flags.h:260: error: expected declaration specifiers or ‘...’ before ‘in’ include/linux/page-flags.h: In function ‘clear_page_dirty’: include/linux/page-flags.h:262: error: ‘must_clean_ptes’ undeclared (first use in this function) include/linux/page-flags.h:262: error: (Each undeclared identifier is reported only once include/linux/page-flags.h:262: error: for each function it appears in.) make[1]: *** [init/main.o] Error 1 make: *** [init] Error 2 There were 2 places where this patch is word wrapped, and this was one of them: -static inline void clear_page_dirty(struct page *page) +static inline void clear_page_dirty(struct page *page, int must_clean_ptes) The other one was in a comment, which screwed the patch and needed fixed too. Is it fubared someplace else I missed? Or am I in fact being bitten by this bug? -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) Yahoo.com and AOL/TW attorneys please note, additions to the above message by Gene Heskett are: Copyright 2006 by Maurice Eugene Heskett, all rights reserved. ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-18 20:41 ` Linus Torvalds 2006-12-18 21:11 ` Andrei Popa @ 2006-12-18 22:34 ` Gene Heskett 2006-12-22 17:27 ` Linus Torvalds 1 sibling, 1 reply; 311+ messages in thread From: Gene Heskett @ 2006-12-18 22:34 UTC (permalink / raw) To: linux-kernel Cc: Linus Torvalds, Andrei Popa, Peter Zijlstra, Andrew Morton, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Monday 18 December 2006 15:41, Linus Torvalds wrote: >On Mon, 18 Dec 2006, Linus Torvalds wrote: >> But at the same time, it's interesting that it still happens when we >> try to re-add the dirty bit. That would tell me that it's one of two >> cases: > >Forget that. There's a third case, which is much more likely: > > - Andrew's patch had a ", 1" where it _should_ have had a ", 0". > >This should be fairly easy to test: just change every single ", 1" case > in the patch to ", 0". > >The only case that _definitely_ would want ",1" is actually the case > that already calls page_mkclean() directly: clear_page_dirty_for_io(). > So no other ", 1" is valid, and that one that needed it already avoided > even calling the "test_clear_page_dirty()" function, because it did it > all by hand. > What about the mm/rmap.c one liner, in or out? Thanks. >What happens for you in that case? > > Linus >- >To unsubscribe from this list: send the line "unsubscribe linux-kernel" > in the body of a message to majordomo@vger.kernel.org >More majordomo info at http://vger.kernel.org/majordomo-info.html >Please read the FAQ at http://www.tux.org/lkml/ -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) Yahoo.com and AOL/TW attorneys please note, additions to the above message by Gene Heskett are: Copyright 2006 by Maurice Eugene Heskett, all rights reserved. ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-18 22:34 ` Gene Heskett @ 2006-12-22 17:27 ` Linus Torvalds 0 siblings, 0 replies; 311+ messages in thread From: Linus Torvalds @ 2006-12-22 17:27 UTC (permalink / raw) To: Gene Heskett Cc: linux-kernel, Andrei Popa, Peter Zijlstra, Andrew Morton, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Mon, 18 Dec 2006, Gene Heskett wrote: > > What about the mm/rmap.c one liner, in or out? The one that just removes the "pte_mkclean()"? That's definitely out, it was just a test-patch to verify that the pte dirty bits seemed to matter at all (and they do). Linus ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-18 20:14 ` Linus Torvalds 2006-12-18 20:41 ` Linus Torvalds @ 2006-12-18 21:43 ` Andrew Morton 2006-12-18 21:49 ` Peter Zijlstra 2006-12-19 23:42 ` Peter Zijlstra 3 siblings, 0 replies; 311+ messages in thread From: Andrew Morton @ 2006-12-18 21:43 UTC (permalink / raw) To: Linus Torvalds Cc: Andrei Popa, Peter Zijlstra, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Mon, 18 Dec 2006 12:14:35 -0800 (PST) Linus Torvalds <torvalds@osdl.org> wrote: > OR: > > - page_mkclean_one() is simply buggy. > > And I'm starting to wonder about the second case. But it all LOOKS really > fine - I can't see anything wrong there (it uses the extremely > conservative "ptep_get_and_clear()", and seems to flush everything right > too, through "ptep_establish()"). What does the call to page_check_address() in there do? It'd be good to have a printk in there to see if it's triggering. Is this all correct for non-linear VMAs? (rtorrent doesn't use MAP_NONLINEAR though). ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-18 20:14 ` Linus Torvalds 2006-12-18 20:41 ` Linus Torvalds 2006-12-18 21:43 ` Andrew Morton @ 2006-12-18 21:49 ` Peter Zijlstra 2006-12-19 23:42 ` Peter Zijlstra 3 siblings, 0 replies; 311+ messages in thread From: Peter Zijlstra @ 2006-12-18 21:49 UTC (permalink / raw) To: Linus Torvalds Cc: Andrei Popa, Andrew Morton, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Mon, 2006-12-18 at 12:14 -0800, Linus Torvalds wrote: > > On Mon, 18 Dec 2006, Andrei Popa wrote: > > > > I dropped that patch and added WARN_ON(1), the unified patch is > > attached. > > > > I got corruption: "Hash check on download completion found bad chunks, > > consider using "safe_sync"." > > Ok. That is actually _very_ interesting. > > It's interesting because (a) the corruption obviously goes away with the > one-liner that effectively disables "page_mkclean_one()". > > So that tells us that yes, it's a PTE dirty bit that matters. > > But at the same time, it's interesting that it still happens when we try > to re-add the dirty bit. That would tell me that it's one of two cases: > > - there is another caller of page cleaning that should have done the same > thing (we could check that by just doing this all _inside_ the > page_mkclean() thing) > > OR: > > - page_mkclean_one() is simply buggy. > > And I'm starting to wonder about the second case. But it all LOOKS really > fine - I can't see anything wrong there (it uses the extremely > conservative "ptep_get_and_clear()", and seems to flush everything right > too, through "ptep_establish()"). How about this: we get confused on what PG_dirty tells us, we fall back to pte_dirty, transfer pte_dirty to PG_dirty and clear pte_dirty. Now it happens again, however we don't have pte_dirty to fall back to anymore. This would explain why disabling pte_mkclean() does make it go away and non of the other tried approaches works. We really need a way to sort out PG_dirty, independent of pte_dirty. ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-18 20:14 ` Linus Torvalds ` (2 preceding siblings ...) 2006-12-18 21:49 ` Peter Zijlstra @ 2006-12-19 23:42 ` Peter Zijlstra 2006-12-20 0:23 ` Linus Torvalds 2006-12-20 14:15 ` Andrei Popa 3 siblings, 2 replies; 311+ messages in thread From: Peter Zijlstra @ 2006-12-19 23:42 UTC (permalink / raw) To: Linus Torvalds Cc: Andrei Popa, Andrew Morton, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Mon, 2006-12-18 at 12:14 -0800, Linus Torvalds wrote: > OR: > > - page_mkclean_one() is simply buggy. GOLD! it seems to work with all this (full diff against current git). /me rebuilds full kernel to make sure... reboot... test... pff the tension... yay, still good! Andrei; would you please verify. The magic seems to be in the extra tlb flush after clearing the dirty bit. Just too bad ptep_clear_flush_dirty() needs ptep not entry. diff --git a/drivers/connector/connector.c b/drivers/connector/connector.c index 5e7cd45..2b8893b 100644 --- a/drivers/connector/connector.c +++ b/drivers/connector/connector.c @@ -135,8 +135,7 @@ static int cn_call_callback(struct cn_msg *msg, void (*destruct_data)(void *), v spin_lock_bh(&dev->cbdev->queue_lock); list_for_each_entry(__cbq, &dev->cbdev->queue_list, callback_entry) { if (cn_cb_equal(&__cbq->id.id, &msg->id)) { - if (likely(!test_bit(WORK_STRUCT_PENDING, - &__cbq->work.work.management) && + if (likely(!delayed_work_pending(&__cbq->work) && __cbq->data.ddata == NULL)) { __cbq->data.callback_priv = msg; diff --git a/fs/buffer.c b/fs/buffer.c index d1f1b54..263f88e 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *page) int ret = 0; BUG_ON(!PageLocked(page)); - if (PageWriteback(page)) + if (PageDirty(page) || PageWriteback(page)) return 0; if (mapping == NULL) { /* can this still happen? */ @@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *page) spin_lock(&mapping->private_lock); ret = drop_buffers(page, &buffers_to_free); spin_unlock(&mapping->private_lock); - if (ret) { - /* - * If the filesystem writes its buffers by hand (eg ext3) - * then we can have clean buffers against a dirty page. We - * clean the page here; otherwise later reattachment of buffers - * could encounter a non-uptodate page, which is unresolvable. - * This only applies in the rare case where try_to_free_buffers - * succeeds but the page is not freed. - * - * Also, during truncate, discard_buffer will have marked all - * the page's buffers clean. We discover that here and clean - * the page also. - */ - if (test_clear_page_dirty(page)) - task_io_account_cancelled_write(PAGE_CACHE_SIZE); - } out: if (buffers_to_free) { struct buffer_head *bh = buffers_to_free; diff --git a/mm/memory.c b/mm/memory.c index c00bac6..60e0945 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1842,6 +1842,33 @@ void unmap_mapping_range(struct address_space *mapping, } EXPORT_SYMBOL(unmap_mapping_range); +static void check_last_page(struct address_space *mapping, loff_t size) +{ + pgoff_t index; + unsigned int offset; + struct page *page; + + if (!mapping) + return; + offset = size & ~PAGE_MASK; + if (!offset) + return; + index = size >> PAGE_SHIFT; + page = find_lock_page(mapping, index); + if (page) { + unsigned int check = 0; + unsigned char *kaddr = kmap_atomic(page, KM_USER0); + do { + check += kaddr[offset++]; + } while (offset < PAGE_SIZE); + kunmap_atomic(kaddr, KM_USER0); + unlock_page(page); + page_cache_release(page); + if (check) + printk(KERN_ERR "%s: BADNESS: truncate check %u\n", current->comm, check); + } +} + /** * vmtruncate - unmap mappings "freed" by truncate() syscall * @inode: inode of the file used @@ -1875,6 +1902,7 @@ do_expand: goto out_sig; if (offset > inode->i_sb->s_maxbytes) goto out_big; + check_last_page(mapping, inode->i_size); i_size_write(inode, offset); out_truncate: diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 237107c..f561e72 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -957,7 +957,7 @@ int test_set_page_writeback(struct page *page) EXPORT_SYMBOL(test_set_page_writeback); /* - * Return true if any of the pages in the mapping are marged with the + * Return true if any of the pages in the mapping are marked with the * passed tag. */ int mapping_tagged(struct address_space *mapping, int tag) diff --git a/mm/rmap.c b/mm/rmap.c index d8a842a..900229a 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -432,7 +432,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma) { struct mm_struct *mm = vma->vm_mm; unsigned long address; - pte_t *pte, entry; + pte_t *ptep, entry; spinlock_t *ptl; int ret = 0; @@ -440,22 +440,23 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma) if (address == -EFAULT) goto out; - pte = page_check_address(page, mm, address, &ptl); - if (!pte) + ptep = page_check_address(page, mm, address, &ptl); + if (!ptep) goto out; - if (!pte_dirty(*pte) && !pte_write(*pte)) + if (!pte_dirty(*ptep) && !pte_write(*ptep)) goto unlock; - entry = ptep_get_and_clear(mm, address, pte); - entry = pte_mkclean(entry); + entry = ptep_get_and_clear(mm, address, ptep); entry = pte_wrprotect(entry); - ptep_establish(vma, address, pte, entry); + ptep_establish(vma, address, ptep, entry); + ret = ptep_clear_flush_dirty(vma, address, ptep) || + page_test_and_clear_dirty(page); lazy_mmu_prot_update(entry); ret = 1; unlock: - pte_unmap_unlock(pte, ptl); + pte_unmap_unlock(ptep, ptl); out: return ret; } ^ permalink raw reply related [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-19 23:42 ` Peter Zijlstra @ 2006-12-20 0:23 ` Linus Torvalds 2006-12-20 9:01 ` Peter Zijlstra 2006-12-20 9:32 ` Peter Zijlstra 2006-12-20 14:15 ` Andrei Popa 1 sibling, 2 replies; 311+ messages in thread From: Linus Torvalds @ 2006-12-20 0:23 UTC (permalink / raw) To: Peter Zijlstra Cc: Andrei Popa, Andrew Morton, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Wed, 20 Dec 2006, Peter Zijlstra wrote: > On Mon, 2006-12-18 at 12:14 -0800, Linus Torvalds wrote: > > OR: > > > > - page_mkclean_one() is simply buggy. > > GOLD! Ok. I was looking at that, and I wondered.. However, if that works, then I _think_ the correct sequence is the following.. The rule should be: - we flush the tlb _after_ we have cleared it, but _before_ we insert the new entry. But I dunno. These things are damn subtle. Does this patch fix it for you? I actually suspect we should do this as an arch-specific macro, and totally replace the current "ptep_clear_flush_dirty()" with one that does "ptep_clear_flush_dirty_and_set_wp()". Because what I'd _really_ prefer to do on x86 (and probably on most other sane architectures) is to do - atomically replace the pte with the EXACT SAME ONE, but one that has the writable bit clear. bit_clear(_PAGE_BIT_RW, &(ptep)->pte_low); - flush the TLB, making sure that all CPU's will no longer write to it: flush_tlb_page(vma, address); - finally, just fetch-and-clear the dirty bit (and since it's no longer writable, nobody should be settign it any more) ret = bit_clear(__PAGE_BIT_DIRTY, &(ptep)->pte_low); and now we should be all done. But the "ptep_get_and_clear() + flush_tlb_page()" sequence should hopefully also work. Pls test. Linus ---- diff --git a/mm/rmap.c b/mm/rmap.c index d8a842a..eec8706 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -448,9 +448,10 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma) goto unlock; entry = ptep_get_and_clear(mm, address, pte); + flush_tlb_page(vma, address); entry = pte_mkclean(entry); entry = pte_wrprotect(entry); - ptep_establish(vma, address, pte, entry); + set_pte_at(mm, address, pte, entry); lazy_mmu_prot_update(entry); ret = 1; ^ permalink raw reply related [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-20 0:23 ` Linus Torvalds @ 2006-12-20 9:01 ` Peter Zijlstra 2006-12-20 9:12 ` Peter Zijlstra ` (2 more replies) 2006-12-20 9:32 ` Peter Zijlstra 1 sibling, 3 replies; 311+ messages in thread From: Peter Zijlstra @ 2006-12-20 9:01 UTC (permalink / raw) To: Linus Torvalds Cc: Andrei Popa, Andrew Morton, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr, Martin Schwidefsky, Heiko Carstens On Tue, 2006-12-19 at 16:23 -0800, Linus Torvalds wrote: > > On Wed, 20 Dec 2006, Peter Zijlstra wrote: > > On Mon, 2006-12-18 at 12:14 -0800, Linus Torvalds wrote: > > > OR: > > > > > > - page_mkclean_one() is simply buggy. > > > > GOLD! > > Ok. I was looking at that, and I wondered.. > > However, if that works, then I _think_ the correct sequence is the > following.. > > The rule should be: > - we flush the tlb _after_ we have cleared it, but _before_ we insert the > new entry. > > But I dunno. These things are damn subtle. Does this patch fix it for you? I will try, but I had a look around the different architectures implementation of ptep_clear_flush_dirty() and saw that not all do the actual flush. So if we go down this road perhaps we should introduce another per arch function that does the potential flush. like flush_tlb_on_clear_dirty() or something like that. Then we could write: entry = ptep_get_and_clear(mm, address, ptep) flush_tlb_on_clear_dirty(vma, address); entry = pte_mkclean(entry); entry = pte_wrprotect(entry); set_pte_at(mm, address, ptep, entry); > I actually suspect we should do this as an arch-specific macro, and > totally replace the current "ptep_clear_flush_dirty()" with one that does > "ptep_clear_flush_dirty_and_set_wp()". > > Because what I'd _really_ prefer to do on x86 (and probably on most other > sane architectures) is to do > > - atomically replace the pte with the EXACT SAME ONE, but one that > has the writable bit clear. > > bit_clear(_PAGE_BIT_RW, &(ptep)->pte_low); > > - flush the TLB, making sure that all CPU's will no longer write to it: > > flush_tlb_page(vma, address); > > - finally, just fetch-and-clear the dirty bit (and since it's no longer > writable, nobody should be settign it any more) > > ret = bit_clear(__PAGE_BIT_DIRTY, &(ptep)->pte_low); > > and now we should be all done. Hmm, should we not flush after clearing the dirty bit? That is, why does ptep_clear_flush_dirty() need a flush after clearing that bit? does it leak through in the tlb copy? Also, what is this page_test_and_clear_dirty() business, that seems to be exclusively s390 btw. However they do seem to need this. > But the "ptep_get_and_clear() + flush_tlb_page()" sequence should > hopefully also work. Yeah, probably, not optimally so on some archs that don't actually need the flush though. And as above, I wonder about s390. (added our s390 friends to the CC list) ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-20 9:01 ` Peter Zijlstra @ 2006-12-20 9:12 ` Peter Zijlstra 2006-12-20 9:39 ` Arjan van de Ven 2006-12-20 14:27 ` 2.6.19 file content corruption on ext3 Martin Schwidefsky 2 siblings, 0 replies; 311+ messages in thread From: Peter Zijlstra @ 2006-12-20 9:12 UTC (permalink / raw) To: Linus Torvalds Cc: Andrei Popa, Andrew Morton, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr, Martin Schwidefsky, Heiko Carstens On Wed, 2006-12-20 at 10:01 +0100, Peter Zijlstra wrote: > I will try, but I had a look around the different architectures > implementation of ptep_clear_flush_dirty() and saw that not all do the > actual flush. So if we go down this road perhaps we should introduce > another per arch function that does the potential flush. like > flush_tlb_on_clear_dirty() or something like that. never mind, we do need an unconditional flush for changing the protection too. ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-20 9:01 ` Peter Zijlstra 2006-12-20 9:12 ` Peter Zijlstra @ 2006-12-20 9:39 ` Arjan van de Ven 2006-12-20 11:26 ` [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) Peter Zijlstra 2006-12-20 14:27 ` 2.6.19 file content corruption on ext3 Martin Schwidefsky 2 siblings, 1 reply; 311+ messages in thread From: Arjan van de Ven @ 2006-12-20 9:39 UTC (permalink / raw) To: Peter Zijlstra Cc: Linus Torvalds, Andrei Popa, Andrew Morton, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr, Martin Schwidefsky, Heiko Carstens > Hmm, should we not flush after clearing the dirty bit? That is, why does > ptep_clear_flush_dirty() need a flush after clearing that bit? does it > leak through in the tlb copy? afaics you need to 1) clear 2) flush 3) check and go to 1) if needed to be race free. ^ permalink raw reply [flat|nested] 311+ messages in thread
* [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-20 9:39 ` Arjan van de Ven @ 2006-12-20 11:26 ` Peter Zijlstra 2006-12-20 11:39 ` Jesper Juhl ` (2 more replies) 0 siblings, 3 replies; 311+ messages in thread From: Peter Zijlstra @ 2006-12-20 11:26 UTC (permalink / raw) To: Arjan van de Ven Cc: Linus Torvalds, Andrei Popa, Andrew Morton, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr, Martin Schwidefsky, Heiko Carstens, Arnd Bergmann fix page_mkclean_one() it had several issues: - it failed to flush the cache - it failed to flush the tlb - it failed to do s390 (s390 guys, please verify this is now correct) Also, clear in a loop to ensure SMP safeness as suggested by Arjan. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- mm/rmap.c | 29 +++++++++++++++-------------- 1 file changed, 15 insertions(+), 14 deletions(-) Index: linux-2.6/mm/rmap.c =================================================================== --- linux-2.6.orig/mm/rmap.c +++ linux-2.6/mm/rmap.c @@ -432,7 +432,7 @@ static int page_mkclean_one(struct page { struct mm_struct *mm = vma->vm_mm; unsigned long address; - pte_t *pte, entry; + pte_t *ptep; spinlock_t *ptl; int ret = 0; @@ -440,22 +440,23 @@ static int page_mkclean_one(struct page if (address == -EFAULT) goto out; - pte = page_check_address(page, mm, address, &ptl); - if (!pte) + ptep = page_check_address(page, mm, address, &ptl); + if (!ptep) goto out; - if (!pte_dirty(*pte) && !pte_write(*pte)) - goto unlock; - - entry = ptep_get_and_clear(mm, address, pte); - entry = pte_mkclean(entry); - entry = pte_wrprotect(entry); - ptep_establish(vma, address, pte, entry); - lazy_mmu_prot_update(entry); - ret = 1; + while (pte_dirty(*ptep) || pte_write(*ptep)) { + pte_t entry = ptep_get_and_clear(mm, address, ptep); + flush_cache_page(vma, address, pte_pfn(entry)); + flush_tlb_page(vma, address); + (void)page_test_and_clear_dirty(page); /* do the s390 thing */ + entry = pte_wrprotect(entry); + entry = pte_mkclean(entry); + set_pte_at(vma, address, ptep, entry); + lazy_mmu_prot_update(entry); + ret = 1; + } -unlock: - pte_unmap_unlock(pte, ptl); + pte_unmap_unlock(ptep, ptl); out: return ret; } ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-20 11:26 ` [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) Peter Zijlstra @ 2006-12-20 11:39 ` Jesper Juhl 2006-12-20 11:42 ` Peter Zijlstra 2006-12-20 13:00 ` Hugh Dickins 2006-12-20 14:55 ` Martin Schwidefsky 2 siblings, 1 reply; 311+ messages in thread From: Jesper Juhl @ 2006-12-20 11:39 UTC (permalink / raw) To: Peter Zijlstra Cc: Arjan van de Ven, Linus Torvalds, Andrei Popa, Andrew Morton, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr, Martin Schwidefsky, Heiko Carstens, Arnd Bergmann On 20/12/06, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote: > > fix page_mkclean_one() > > it had several issues: > - it failed to flush the cache > - it failed to flush the tlb > - it failed to do s390 (s390 guys, please verify this is now correct) > > Also, clear in a loop to ensure SMP safeness as suggested by Arjan. > > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> > --- > mm/rmap.c | 29 +++++++++++++++-------------- > 1 file changed, 15 insertions(+), 14 deletions(-) > > Index: linux-2.6/mm/rmap.c > =================================================================== > --- linux-2.6.orig/mm/rmap.c > +++ linux-2.6/mm/rmap.c > @@ -432,7 +432,7 @@ static int page_mkclean_one(struct page > { > struct mm_struct *mm = vma->vm_mm; > unsigned long address; > - pte_t *pte, entry; > + pte_t *ptep; > spinlock_t *ptl; > int ret = 0; > > @@ -440,22 +440,23 @@ static int page_mkclean_one(struct page > if (address == -EFAULT) > goto out; > > - pte = page_check_address(page, mm, address, &ptl); > - if (!pte) > + ptep = page_check_address(page, mm, address, &ptl); > + if (!ptep) > goto out; > > - if (!pte_dirty(*pte) && !pte_write(*pte)) > - goto unlock; > - > - entry = ptep_get_and_clear(mm, address, pte); > - entry = pte_mkclean(entry); > - entry = pte_wrprotect(entry); > - ptep_establish(vma, address, pte, entry); > - lazy_mmu_prot_update(entry); > - ret = 1; > + while (pte_dirty(*ptep) || pte_write(*ptep)) { > + pte_t entry = ptep_get_and_clear(mm, address, ptep); > + flush_cache_page(vma, address, pte_pfn(entry)); > + flush_tlb_page(vma, address); > + (void)page_test_and_clear_dirty(page); /* do the s390 thing */ > + entry = pte_wrprotect(entry); > + entry = pte_mkclean(entry); > + set_pte_at(vma, address, ptep, entry); > + lazy_mmu_prot_update(entry); > + ret = 1; > + } > Having the assignment of "ret = 1;" inside the loop seems a little pointless. Perhaps gcc can optimize it, but still, that assignment really only needs to happen once outside the loop. > -unlock: > - pte_unmap_unlock(pte, ptl); > + pte_unmap_unlock(ptep, ptl); > out: > return ret; > } > -- Jesper Juhl <jesper.juhl@gmail.com> Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html Plain text mails only, please http://www.expita.com/nomime.html ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-20 11:39 ` Jesper Juhl @ 2006-12-20 11:42 ` Peter Zijlstra 2006-12-20 12:12 ` Jesper Juhl 0 siblings, 1 reply; 311+ messages in thread From: Peter Zijlstra @ 2006-12-20 11:42 UTC (permalink / raw) To: Jesper Juhl Cc: Arjan van de Ven, Linus Torvalds, Andrei Popa, Andrew Morton, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr, Martin Schwidefsky, Heiko Carstens, Arnd Bergmann On Wed, 2006-12-20 at 12:39 +0100, Jesper Juhl wrote: > On 20/12/06, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote: > > > > fix page_mkclean_one() > > > > it had several issues: > > - it failed to flush the cache > > - it failed to flush the tlb > > - it failed to do s390 (s390 guys, please verify this is now correct) > > > > Also, clear in a loop to ensure SMP safeness as suggested by Arjan. > > > > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> > > --- > > mm/rmap.c | 29 +++++++++++++++-------------- > > 1 file changed, 15 insertions(+), 14 deletions(-) > > > > Index: linux-2.6/mm/rmap.c > > =================================================================== > > --- linux-2.6.orig/mm/rmap.c > > +++ linux-2.6/mm/rmap.c > > @@ -432,7 +432,7 @@ static int page_mkclean_one(struct page > > { > > struct mm_struct *mm = vma->vm_mm; > > unsigned long address; > > - pte_t *pte, entry; > > + pte_t *ptep; > > spinlock_t *ptl; > > int ret = 0; > > > > @@ -440,22 +440,23 @@ static int page_mkclean_one(struct page > > if (address == -EFAULT) > > goto out; > > > > - pte = page_check_address(page, mm, address, &ptl); > > - if (!pte) > > + ptep = page_check_address(page, mm, address, &ptl); > > + if (!ptep) > > goto out; > > > > - if (!pte_dirty(*pte) && !pte_write(*pte)) > > - goto unlock; > > - > > - entry = ptep_get_and_clear(mm, address, pte); > > - entry = pte_mkclean(entry); > > - entry = pte_wrprotect(entry); > > - ptep_establish(vma, address, pte, entry); > > - lazy_mmu_prot_update(entry); > > - ret = 1; > > + while (pte_dirty(*ptep) || pte_write(*ptep)) { > > + pte_t entry = ptep_get_and_clear(mm, address, ptep); > > + flush_cache_page(vma, address, pte_pfn(entry)); > > + flush_tlb_page(vma, address); > > + (void)page_test_and_clear_dirty(page); /* do the s390 thing */ > > + entry = pte_wrprotect(entry); > > + entry = pte_mkclean(entry); > > + set_pte_at(vma, address, ptep, entry); > > + lazy_mmu_prot_update(entry); > > + ret = 1; > > + } > > > Having the assignment of "ret = 1;" inside the loop seems a little > pointless. Perhaps gcc can optimize it, but still, that assignment > really only needs to happen once outside the loop. Sure, but I was hoping gcc was smart enough. Placing it outside the loop would require an extra if stmt. Also the chance this loop will actually be traversed more than once is _very_ small. ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-20 11:42 ` Peter Zijlstra @ 2006-12-20 12:12 ` Jesper Juhl 0 siblings, 0 replies; 311+ messages in thread From: Jesper Juhl @ 2006-12-20 12:12 UTC (permalink / raw) To: Peter Zijlstra Cc: Arjan van de Ven, Linus Torvalds, Andrei Popa, Andrew Morton, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr, Martin Schwidefsky, Heiko Carstens, Arnd Bergmann On 20/12/06, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote: > On Wed, 2006-12-20 at 12:39 +0100, Jesper Juhl wrote: > > Having the assignment of "ret = 1;" inside the loop seems a little > > pointless. Perhaps gcc can optimize it, but still, that assignment > > really only needs to happen once outside the loop. > > Sure, but I was hoping gcc was smart enough. Placing it outside the loop > would require an extra if stmt. Also the chance this loop will actually > be traversed more than once is _very_ small. > allright - I just spotted it and thought I'd point it out :-) -- Jesper Juhl <jesper.juhl@gmail.com> Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html Plain text mails only, please http://www.expita.com/nomime.html ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-20 11:26 ` [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) Peter Zijlstra 2006-12-20 11:39 ` Jesper Juhl @ 2006-12-20 13:00 ` Hugh Dickins 2006-12-20 13:56 ` Peter Zijlstra 2006-12-20 14:55 ` Martin Schwidefsky 2 siblings, 1 reply; 311+ messages in thread From: Hugh Dickins @ 2006-12-20 13:00 UTC (permalink / raw) To: Peter Zijlstra Cc: Arjan van de Ven, Linus Torvalds, Andrei Popa, Andrew Morton, Linux Kernel Mailing List, Florian Weimer, Marc Haber, Martin Michlmayr, Martin Schwidefsky, Heiko Carstens, Arnd Bergmann On Wed, 20 Dec 2006, Peter Zijlstra wrote: > > fix page_mkclean_one() Congratulations on getting to the bottom of it, Peter (if you have: I haven't digested enough of the thread to tell). I'm mostly offline at present, no time for dialogue, I'll throw out a few remarks and run... > > it had several issues: > - it failed to flush the cache It's unclear to me why it should need to flush the cache, but I don't know much about that, and mprotect does flush the cache in advance - I think others will tell you that if it does need to be flushed, it must be flushed while there's still a valid pte (on some arches at least). > - it failed to flush the tlb Eh? It flushed the TLB inside ptep_establish, didn't it? I guess you mean you've found a race before it flushed the TLB. > - it failed to do s390 (s390 guys, please verify this is now correct) Hmm, I thought we cleared it with them back at the time. > > Also, clear in a loop to ensure SMP safeness as suggested by Arjan. Yikes. Well, please compare with mprotect's change_pte_range. I think I took that as the relevant standard when checking your implementation, and back then satisfied myself that what you were doing was equivalent. If page_mkclean_one is now agreed to be significantly defective, then I suspect change_pte_range is also; perhaps others too. (But I haven't found time to do more than skim through the thread, I've not thought through the issues at all: I am surprised that it's now found defective, we looked at it long and hard back then.) And trivial point: please undo those distracting "pte" to "ptep" mods: if you want to call pte pointers ptep, throughout rmap.c and throughout mm, that's another patch entirely (which I won't welcome, but others may). Hugh ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-20 13:00 ` Hugh Dickins @ 2006-12-20 13:56 ` Peter Zijlstra 2006-12-20 17:03 ` Martin Michlmayr 0 siblings, 1 reply; 311+ messages in thread From: Peter Zijlstra @ 2006-12-20 13:56 UTC (permalink / raw) To: Hugh Dickins Cc: Arjan van de Ven, Linus Torvalds, Andrei Popa, Andrew Morton, Linux Kernel Mailing List, Florian Weimer, Marc Haber, Martin Michlmayr, Martin Schwidefsky, Heiko Carstens, Arnd Bergmann On Wed, 2006-12-20 at 13:00 +0000, Hugh Dickins wrote: > On Wed, 20 Dec 2006, Peter Zijlstra wrote: > > > > fix page_mkclean_one() > > Congratulations on getting to the bottom of it, Peter (if you have: > I haven't digested enough of the thread to tell). Well, I thought I understood, you just shattered that. > I'm mostly offline at > present, no time for dialogue, I'll throw out a few remarks and run... I wondered where you were ;-) Enjoy your time away from the computer. > > > > it had several issues: > > - it failed to flush the cache > > It's unclear to me why it should need to flush the cache, but I don't > know much about that, and mprotect does flush the cache in advance - > I think others will tell you that if it does need to be flushed, I was still thinking about why exactly, but indeed since mprotect does I thought it prudent to also do it. > it must > be flushed while there's still a valid pte (on some arches at least). Ah, good point, makes sense I guess. > > - it failed to flush the tlb > > Eh? It flushed the TLB inside ptep_establish, didn't it? > I guess you mean you've found a race before it flushed the TLB. Hmm, quite right indeed. I missed that. So moving the flush inside the pte cleared section closed a race. It seems I must have a long hard look at these architecture manuals... > > - it failed to do s390 (s390 guys, please verify this is now correct) > > Hmm, I thought we cleared it with them back at the time. /me queries mail folder... can't seem to find it. > > > > Also, clear in a loop to ensure SMP safeness as suggested by Arjan. > > Yikes. Well, please compare with mprotect's change_pte_range. I think > I took that as the relevant standard when checking your implementation, > and back then satisfied myself that what you were doing was equivalent. > If page_mkclean_one is now agreed to be significantly defective, then > I suspect change_pte_range is also; perhaps others too. Arjan argued that mprotect and msync would mostly race with themselves in userspace. > (But I haven't found time to do more than skim through the thread, > I've not thought through the issues at all: I am surprised that it's > now found defective, we looked at it long and hard back then.) --- page_mkclean_one() fix it had several issues: - it failed to flush the cache - a race wrt tlb flushing - it failed to do s390 (s390 guys, please verify this is now correct) Also, clear in a loop to ensure SMP safeness as suggested by Arjan. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- mm/rmap.c | 23 +++++++++++++---------- 1 file changed, 13 insertions(+), 10 deletions(-) Index: linux-2.6/mm/rmap.c =================================================================== --- linux-2.6.orig/mm/rmap.c +++ linux-2.6/mm/rmap.c @@ -432,7 +432,7 @@ static int page_mkclean_one(struct page { struct mm_struct *mm = vma->vm_mm; unsigned long address; - pte_t *pte, entry; + pte_t *pte; spinlock_t *ptl; int ret = 0; @@ -444,17 +444,20 @@ static int page_mkclean_one(struct page if (!pte) goto out; - if (!pte_dirty(*pte) && !pte_write(*pte)) - goto unlock; + while (pte_dirty(*pte) || pte_write(*pte)) { + pte_t entry; - entry = ptep_get_and_clear(mm, address, pte); - entry = pte_mkclean(entry); - entry = pte_wrprotect(entry); - ptep_establish(vma, address, pte, entry); - lazy_mmu_prot_update(entry); - ret = 1; + flush_cache_page(vma, address, pte_pfn(*pte)); + entry = ptep_get_and_clear(mm, address, pte); + flush_tlb_page(vma, address); + (void)page_test_and_clear_dirty(page); /* do the s390 thing */ + entry = pte_wrprotect(entry); + entry = pte_mkclean(entry); + set_pte_at(vma, address, pte, entry); + lazy_mmu_prot_update(entry); + ret = 1; + } -unlock: pte_unmap_unlock(pte, ptl); out: return ret; ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-20 13:56 ` Peter Zijlstra @ 2006-12-20 17:03 ` Martin Michlmayr 2006-12-20 17:35 ` Linus Torvalds 2006-12-20 22:11 ` Russell King 0 siblings, 2 replies; 311+ messages in thread From: Martin Michlmayr @ 2006-12-20 17:03 UTC (permalink / raw) To: Peter Zijlstra Cc: Hugh Dickins, Arjan van de Ven, Linus Torvalds, Andrei Popa, Andrew Morton, Linux Kernel Mailing List, Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens, Arnd Bergmann, gordonfarquharson * Peter Zijlstra <a.p.zijlstra@chello.nl> [2006-12-20 14:56]: > page_mkclean_one() fix This patch doesn't fix my problem (apt segfaults on ARM because its database is corrupted). -- Martin Michlmayr http://www.cyrius.com/ ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-20 17:03 ` Martin Michlmayr @ 2006-12-20 17:35 ` Linus Torvalds 2006-12-20 17:53 ` Martin Michlmayr 2006-12-20 22:11 ` Russell King 1 sibling, 1 reply; 311+ messages in thread From: Linus Torvalds @ 2006-12-20 17:35 UTC (permalink / raw) To: Martin Michlmayr Cc: Peter Zijlstra, Hugh Dickins, Arjan van de Ven, Andrei Popa, Andrew Morton, Linux Kernel Mailing List, Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens, Arnd Bergmann, gordonfarquharson On Wed, 20 Dec 2006, Martin Michlmayr wrote: > * Peter Zijlstra <a.p.zijlstra@chello.nl> [2006-12-20 14:56]: > > page_mkclean_one() fix > > This patch doesn't fix my problem (apt segfaults on ARM because its > database is corrupted). Can you remind us: - your ARM is UP, right? Do you have PREEMPT on? - This is probably a stupid question, but you did make sure that the database was ok (with some rebuild command) and that you didn't have preexisting corruption? Anyway, the page_mkclean_one() fixes (along with _most_ things we've looked at) shouldn't matter on UP, at least certainly not without PREEMPT. Linus ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-20 17:35 ` Linus Torvalds @ 2006-12-20 17:53 ` Martin Michlmayr 2006-12-20 19:01 ` Linus Torvalds 0 siblings, 1 reply; 311+ messages in thread From: Martin Michlmayr @ 2006-12-20 17:53 UTC (permalink / raw) To: Linus Torvalds Cc: Peter Zijlstra, Hugh Dickins, Arjan van de Ven, Andrei Popa, Andrew Morton, Linux Kernel Mailing List, Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens, Arnd Bergmann, gordonfarquharson * Linus Torvalds <torvalds@osdl.org> [2006-12-20 09:35]: > Can you remind us: > - your ARM is UP, right? Do you have PREEMPT on? It's UP and PREEMPT is not set. I used 2.6.19 plus the patch that has been posted. > - This is probably a stupid question, but you did make sure that the > database was ok (with some rebuild command) and that you didn't have > preexisting corruption? Yes, my test case is to install Debian on the ARM machine so the database is created fresh. While the corruption always triggers during a fresh installation, it's much harder to see in a running system. Some people see it on their system but I haven't found a 100% working recipe to reproduce it yet given a working system; doing a new installation seems to trigger it all the time though. > Anyway, the page_mkclean_one() fixes (along with _most_ things we've > looked at) shouldn't matter on UP, at least certainly not without > PREEMPT. Hmm. So what about UP without PREEMPT then... Maybe the following information is helpful in some way: remember how I said that we have applied 6 mm patches to 2.6.18 in Debian? According to Gordon Farquharson, who's helping me a great deal with testing installation on this ARM machine (Linksys NSLU2), the corruption doesn't always show up when you only apply mm-tracking-shared-dirty-pages.patch to 2.6.18 but it shows up all the time with all six patches applied. As a reminder, the 6 patches we apply are: mm-tracking-shared-dirty-pages.patch mm-balance-dirty-pages.patch mm-optimize-mprotect.patch mm-install_page-cleanup.patch mm-do_wp_page-fixup.patch mm-msync-cleanup.patch -- Martin Michlmayr http://www.cyrius.com/ ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-20 17:53 ` Martin Michlmayr @ 2006-12-20 19:01 ` Linus Torvalds 2006-12-20 19:50 ` Linus Torvalds 0 siblings, 1 reply; 311+ messages in thread From: Linus Torvalds @ 2006-12-20 19:01 UTC (permalink / raw) To: Martin Michlmayr Cc: Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven, Andrei Popa, Andrew Morton, Linux Kernel Mailing List, Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens, Arnd Bergmann, gordonfarquharson On Wed, 20 Dec 2006, Martin Michlmayr wrote: > > > Anyway, the page_mkclean_one() fixes (along with _most_ things we've > > looked at) shouldn't matter on UP, at least certainly not without > > PREEMPT. > > Hmm. So what about UP without PREEMPT then... So that's why I've been harping on the fact that I think we simply do really wrong things with PG_dirty at times, and that I find it confusing that there's - clear_page_dirty_for_io(): this one makes sense. The name makes sense, and the implementation makes sense (which is _not_ the same thing as "works", of course - "makes sense" does not mean "no bugs" ;). - test_clear_page_dirty: this one makes no sense WHATSOEVER, except as a buggy way to do the "_for_io()" case.. This makes sense neither from a concept angle _or_ an implementation angle (the whole "test_" part is nonsense: why would anybody care? What operation does this? What can it do if the page is dirty? It also has no sensible thing it can do to the page tables. - clear_page_dirty(): this one makes sense only as a "cancel" operation, for vmtruncate and friends (it's different from the "_for_io()" case in several ways: (a) we should have unmapped such pages forcibly _anyway_, so looking at the PTE's make no sense. (b) because we're not starting IO, we don't have the "mark for writeback" case, and we need to clear the dirty tags from the radix trees etc since the writeback logic won't do it for us. The _implementation_ of "clear_page_dirty()" doesn't make sense, but the concept does. I've repeated that theory a few times, but neither Andrew nor Nick seem to really believe in it. So I'll just repeat it once more, only to be shot down. I think we have three operations, one of which is totally idiotic and senseless, and one of which is just badly implemented. > Maybe the following information is helpful in some way: remember how I > said that we have applied 6 mm patches to 2.6.18 in Debian? According > to Gordon Farquharson, who's helping me a great deal with testing > installation on this ARM machine (Linksys NSLU2), the corruption > doesn't always show up when you only apply > mm-tracking-shared-dirty-pages.patch to 2.6.18 but it shows up all the > time with all six patches applied. I think the "it hapepns occasionally with just the first patch" is the really important part. The other patches really are likely to just change writeback timing behaviour (_especially_ the "tracking-shared-dirty-pages" patch), but if it happens occasionally even with the first one, that's the one that almost certainly introduced the real problem. And my argument above is actually that the "real problem" goes a hell of a lot further back in time, but it didn't use to be a problem because we just considered dirty bits in the page tables to be something _completely_ independent of the "page dirty" status, so historically, it just didn't matter that we had insane implementations and senseless operations. Linus ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-20 19:01 ` Linus Torvalds @ 2006-12-20 19:50 ` Linus Torvalds 2006-12-20 20:22 ` Peter Zijlstra ` (6 more replies) 0 siblings, 7 replies; 311+ messages in thread From: Linus Torvalds @ 2006-12-20 19:50 UTC (permalink / raw) To: Martin Michlmayr Cc: Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven, Andrei Popa, Andrew Morton, Linux Kernel Mailing List, Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens, Arnd Bergmann, gordonfarquharson On Wed, 20 Dec 2006, Linus Torvalds wrote: > > So that's why I've been harping on the fact that I think we simply do > really wrong things with PG_dirty at times [ ... ] Ok, I'll just put my money where my mouth is, and suggest a patch like THIS instead. This one clears up all the issues I find irritating: - "test_clear_page_dirty()" is insane, both conceptually and as an implementation. "Give me a 'C', give me an 'R', give me an 'A', give me a 'P'". So rip out that mindfart entirely. - "clear_page_dirty()" is badly named, and should be about CANCELLING the dirty bit, and must never be called with pages mapped anyway. So throw that out too, and replace it with a new function: void cancel_dirty_page(struct page *page, unsigned int accounting_size); - "clear_page_dirty_for_io()" is fine. And with that, I then either rip out any old users of "test_clear_page_dirty()" or "clear_page_dirty()", and if appropriate (and it's realy lonly appropriate for "truncate()", I replace them with the new "cancel_dirty_page()". Most of the time, they should just be deleted entirely. NOTE NOTE NOTE! I _only_ did enough to make things compile for my particular configuration. That means that right now the following filesystems are broken with this patch (because they use the totally broken old crap): CIFS, FUSE, JFS, ReiserFS, XFS and I don't know exactly what they need to be fixed. But most likely their usage was insane and pointless anyway (looking at the ReiserFS case, for example, that was DEFINITELY the case. I can't even imagine what the heck it thinks it is doing). Anyway, I'm not at all guaranteeing that this solves anything at all. I _do_ guarantee that this is a h*ll of a lot saner than what we had before. [ This also includes a few of my older patches, I didn't bother to sort them out, and the fs/buffer.c patch is required because it got rid of one of the insane uses of test_clear_page_dirty(). So this goes directly on top of current -git, with no other changes in the tree. ] Nick, Hugh, Peter, Andrew? Comments? Martin, Andrei, does this make any difference for your corruption cases? Linus --- diff --git a/fs/buffer.c b/fs/buffer.c index d1f1b54..263f88e 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *page) int ret = 0; BUG_ON(!PageLocked(page)); - if (PageWriteback(page)) + if (PageDirty(page) || PageWriteback(page)) return 0; if (mapping == NULL) { /* can this still happen? */ @@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *page) spin_lock(&mapping->private_lock); ret = drop_buffers(page, &buffers_to_free); spin_unlock(&mapping->private_lock); - if (ret) { - /* - * If the filesystem writes its buffers by hand (eg ext3) - * then we can have clean buffers against a dirty page. We - * clean the page here; otherwise later reattachment of buffers - * could encounter a non-uptodate page, which is unresolvable. - * This only applies in the rare case where try_to_free_buffers - * succeeds but the page is not freed. - * - * Also, during truncate, discard_buffer will have marked all - * the page's buffers clean. We discover that here and clean - * the page also. - */ - if (test_clear_page_dirty(page)) - task_io_account_cancelled_write(PAGE_CACHE_SIZE); - } out: if (buffers_to_free) { struct buffer_head *bh = buffers_to_free; diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index ed2c223..4f4cd13 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -176,7 +176,7 @@ static int hugetlbfs_commit_write(struct file *file, static void truncate_huge_page(struct page *page) { - clear_page_dirty(page); + cancel_dirty_page(page, /* No IO accounting for huge pages? */0); ClearPageUptodate(page); remove_from_page_cache(page); put_page(page); diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 4830a3b..350878a 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -253,15 +253,11 @@ static inline void SetPageUptodate(struct page *page) struct page; /* forward declaration */ -int test_clear_page_dirty(struct page *page); +extern void cancel_dirty_page(struct page *page, unsigned int account_size); + int test_clear_page_writeback(struct page *page); int test_set_page_writeback(struct page *page); -static inline void clear_page_dirty(struct page *page) -{ - test_clear_page_dirty(page); -} - static inline void set_page_writeback(struct page *page) { test_set_page_writeback(page); diff --git a/mm/memory.c b/mm/memory.c index c00bac6..79cecab 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1842,6 +1842,33 @@ void unmap_mapping_range(struct address_space *mapping, } EXPORT_SYMBOL(unmap_mapping_range); +static void check_last_page(struct address_space *mapping, loff_t size) +{ + pgoff_t index; + unsigned int offset; + struct page *page; + + if (!mapping) + return; + offset = size & ~PAGE_MASK; + if (!offset) + return; + index = size >> PAGE_SHIFT; + page = find_lock_page(mapping, index); + if (page) { + unsigned int check = 0; + unsigned char *kaddr = kmap_atomic(page, KM_USER0); + do { + check += kaddr[offset++]; + } while (offset < PAGE_SIZE); + kunmap_atomic(kaddr,KM_USER0); + unlock_page(page); + page_cache_release(page); + if (check) + printk("%s: BADNESS: truncate check %u\n", current->comm, check); + } +} + /** * vmtruncate - unmap mappings "freed" by truncate() syscall * @inode: inode of the file used @@ -1875,6 +1902,7 @@ do_expand: goto out_sig; if (offset > inode->i_sb->s_maxbytes) goto out_big; + check_last_page(mapping, inode->i_size); i_size_write(inode, offset); out_truncate: diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 237107c..b3a198c 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -845,38 +845,6 @@ int set_page_dirty_lock(struct page *page) EXPORT_SYMBOL(set_page_dirty_lock); /* - * Clear a page's dirty flag, while caring for dirty memory accounting. - * Returns true if the page was previously dirty. - */ -int test_clear_page_dirty(struct page *page) -{ - struct address_space *mapping = page_mapping(page); - unsigned long flags; - - if (!mapping) - return TestClearPageDirty(page); - - write_lock_irqsave(&mapping->tree_lock, flags); - if (TestClearPageDirty(page)) { - radix_tree_tag_clear(&mapping->page_tree, - page_index(page), PAGECACHE_TAG_DIRTY); - write_unlock_irqrestore(&mapping->tree_lock, flags); - /* - * We can continue to use `mapping' here because the - * page is locked, which pins the address_space - */ - if (mapping_cap_account_dirty(mapping)) { - page_mkclean(page); - dec_zone_page_state(page, NR_FILE_DIRTY); - } - return 1; - } - write_unlock_irqrestore(&mapping->tree_lock, flags); - return 0; -} -EXPORT_SYMBOL(test_clear_page_dirty); - -/* * Clear a page's dirty flag, while caring for dirty memory accounting. * Returns true if the page was previously dirty. * diff --git a/mm/truncate.c b/mm/truncate.c index 9bfb8e8..bf9e296 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -51,6 +51,20 @@ static inline void truncate_partial_page(struct page *page, unsigned partial) do_invalidatepage(page, partial); } +void cancel_dirty_page(struct page *page, unsigned int account_size) +{ + /* If we're cancelling the page, it had better not be mapped any more */ + if (page_mapped(page)) { + static unsigned int warncount; + + WARN_ON(++warncount < 5); + } + + if (TestClearPageDirty(page) && account_size) + task_io_account_cancelled_write(account_size); +} + + /* * If truncate cannot remove the fs-private metadata from the page, the page * becomes anonymous. It will be left on the LRU and may even be mapped into @@ -70,8 +84,8 @@ truncate_complete_page(struct address_space *mapping, struct page *page) if (PagePrivate(page)) do_invalidatepage(page, 0); - if (test_clear_page_dirty(page)) - task_io_account_cancelled_write(PAGE_CACHE_SIZE); + cancel_dirty_page(page, PAGE_CACHE_SIZE); + ClearPageUptodate(page); ClearPageMappedToDisk(page); remove_from_page_cache(page); @@ -350,7 +364,6 @@ int invalidate_inode_pages2_range(struct address_space *mapping, for (i = 0; !ret && i < pagevec_count(&pvec); i++) { struct page *page = pvec.pages[i]; pgoff_t page_index; - int was_dirty; lock_page(page); if (page->mapping != mapping) { @@ -386,12 +399,8 @@ int invalidate_inode_pages2_range(struct address_space *mapping, PAGE_CACHE_SIZE, 0); } } - was_dirty = test_clear_page_dirty(page); - if (!invalidate_complete_page2(mapping, page)) { - if (was_dirty) - set_page_dirty(page); + if (!invalidate_complete_page2(mapping, page)) ret = -EIO; - } unlock_page(page); } pagevec_release(&pvec); ^ permalink raw reply related [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-20 19:50 ` Linus Torvalds @ 2006-12-20 20:22 ` Peter Zijlstra 2006-12-20 21:55 ` Dave Kleikamp ` (5 subsequent siblings) 6 siblings, 0 replies; 311+ messages in thread From: Peter Zijlstra @ 2006-12-20 20:22 UTC (permalink / raw) To: Linus Torvalds Cc: Martin Michlmayr, Hugh Dickins, Nick Piggin, Arjan van de Ven, Andrei Popa, Andrew Morton, Linux Kernel Mailing List, Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens, Arnd Bergmann, gordonfarquharson On Wed, 2006-12-20 at 11:50 -0800, Linus Torvalds wrote: > Nick, Hugh, Peter, Andrew? Comments? Hooray! I'm all for this cleanup. Let us see where this road leads.. ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-20 19:50 ` Linus Torvalds 2006-12-20 20:22 ` Peter Zijlstra @ 2006-12-20 21:55 ` Dave Kleikamp 2006-12-20 22:25 ` Linus Torvalds 2006-12-20 22:15 ` Peter Zijlstra ` (4 subsequent siblings) 6 siblings, 1 reply; 311+ messages in thread From: Dave Kleikamp @ 2006-12-20 21:55 UTC (permalink / raw) To: Linus Torvalds Cc: Martin Michlmayr, Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven, Andrei Popa, Andrew Morton, Linux Kernel Mailing List, Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens, Arnd Bergmann, gordonfarquharson On Wed, 2006-12-20 at 11:50 -0800, Linus Torvalds wrote: > NOTE NOTE NOTE! I _only_ did enough to make things compile for my > particular configuration. That means that right now the following > filesystems are broken with this patch (because they use the totally > broken old crap): > > CIFS, FUSE, JFS, ReiserFS, XFS > > and I don't know exactly what they need to be fixed. But most likely their > usage was insane and pointless anyway (looking at the ReiserFS case, for > example, that was DEFINITELY the case. I can't even imagine what the heck > it thinks it is doing). Here's a patch to get rid of clear_page_dirty() from jfs. I'm not convinced it was totally broken, but I'm not convinced it wasn't. Either way, I don't think that bit of code was particularly beneficial. Feel free to apply this patch independent of your patch if you really think that jfs's use of clear_page_dirty is crap, or I can push it through -mm first. This patch removes some questionable code that attempted to make a no-longer-used page easier to reclaim. Calling metapage_writepage against such a page will not result in any I/O being performed, so removing this code shouldn't be a big deal. Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com> diff -Nurp linux-orig/fs/jfs/jfs_metapage.c linux/fs/jfs/jfs_metapage.c --- linux-orig/fs/jfs/jfs_metapage.c 2006-12-07 17:12:58.000000000 -0600 +++ linux/fs/jfs/jfs_metapage.c 2006-12-20 15:19:48.000000000 -0600 @@ -764,22 +764,9 @@ void release_metapage(struct metapage * } else if (mp->lsn) /* discard_metapage doesn't remove it */ remove_from_logsync(mp); -#if MPS_PER_PAGE == 1 - /* - * If we know this is the only thing in the page, we can throw - * the page out of the page cache. If pages are larger, we - * don't want to do this. - */ - - /* Retest mp->count since we may have released page lock */ - if (test_bit(META_discard, &mp->flag) && !mp->count) { - clear_page_dirty(page); - ClearPageUptodate(page); - } -#else /* Try to keep metapages from using up too much memory */ drop_metapage(page, mp); -#endif + unlock_page(page); page_cache_release(page); } ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-20 21:55 ` Dave Kleikamp @ 2006-12-20 22:25 ` Linus Torvalds 2006-12-20 22:59 ` Dave Kleikamp 0 siblings, 1 reply; 311+ messages in thread From: Linus Torvalds @ 2006-12-20 22:25 UTC (permalink / raw) To: Dave Kleikamp Cc: Martin Michlmayr, Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven, Andrei Popa, Andrew Morton, Linux Kernel Mailing List, Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens, Arnd Bergmann, gordonfarquharson On Wed, 20 Dec 2006, Dave Kleikamp wrote: > > This patch removes some questionable code that attempted to make a > no-longer-used page easier to reclaim. If so, "cancel_dirty_page()" may actually be the right thing to use, but only if you can guarantee that the page isn't mapped anywhere (and from the name of the function I guess it's not something that you'll ever map?) So the JFS code _looks_ like you could just replace the clear_page_dirty(page); with cancel_dirty_page(page, PAGE_CACHE_SIZE); (where that second parameter is just used for statistics - it updates the "cancelled IO" byte-counts if CONFIG_TASK_IO_ACCOUNTING is set - so the number doesn't really matter, you could make it zero if you never want the thing to show up in the IO accounting). Linus ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-20 22:25 ` Linus Torvalds @ 2006-12-20 22:59 ` Dave Kleikamp 0 siblings, 0 replies; 311+ messages in thread From: Dave Kleikamp @ 2006-12-20 22:59 UTC (permalink / raw) To: Linus Torvalds Cc: Martin Michlmayr, Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven, Andrei Popa, Andrew Morton, Linux Kernel Mailing List, Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens, Arnd Bergmann, gordonfarquharson On Wed, 2006-12-20 at 14:25 -0800, Linus Torvalds wrote: > > On Wed, 20 Dec 2006, Dave Kleikamp wrote: > > > > This patch removes some questionable code that attempted to make a > > no-longer-used page easier to reclaim. > > If so, "cancel_dirty_page()" may actually be the right thing to use, but > only if you can guarantee that the page isn't mapped anywhere (and from > the name of the function I guess it's not something that you'll ever map?) That's correct. It can't be mapped. It's a private mapping only used for metadata. I'm really not sure the code in question is having the intended effect. Maybe one of the gurus on cc: can take a look at the code and tell me if it's worth keeping. I apologize in advance if it makes anyone lose their lunch. > So the JFS code _looks_ like you could just replace the > > clear_page_dirty(page); > > with > > cancel_dirty_page(page, PAGE_CACHE_SIZE); > > (where that second parameter is just used for statistics - it updates the > "cancelled IO" byte-counts if CONFIG_TASK_IO_ACCOUNTING is set - so the > number doesn't really matter, you could make it zero if you never want the > thing to show up in the IO accounting). I'm not sure whether zero or PAGE_CACHE_SIZE would be better. The situation is where some page of metadata is no longer used, say shrinking a directory tree or truncating a file and throwing out the extent tree. Thanks, Shaggy -- David Kleikamp IBM Linux Technology Center ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-20 19:50 ` Linus Torvalds 2006-12-20 20:22 ` Peter Zijlstra 2006-12-20 21:55 ` Dave Kleikamp @ 2006-12-20 22:15 ` Peter Zijlstra 2006-12-20 22:20 ` Peter Zijlstra ` (2 more replies) 2006-12-20 23:24 ` David Chinner ` (3 subsequent siblings) 6 siblings, 3 replies; 311+ messages in thread From: Peter Zijlstra @ 2006-12-20 22:15 UTC (permalink / raw) To: Linus Torvalds Cc: Martin Michlmayr, Hugh Dickins, Nick Piggin, Arjan van de Ven, Andrei Popa, Andrew Morton, Linux Kernel Mailing List, Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens, Arnd Bergmann, gordonfarquharson I think this is also needed: --- mm/truncate.c | 7 +------ 1 file changed, 1 insertion(+), 6 deletions(-) Index: linux-2.6/mm/truncate.c =================================================================== --- linux-2.6.orig/mm/truncate.c +++ linux-2.6/mm/truncate.c @@ -320,19 +320,14 @@ invalidate_complete_page2(struct address if (PagePrivate(page) && !try_to_release_page(page, GFP_KERNEL)) return 0; + cancel_dirty_page(page, PAGE_CACHE_SIZE); lock_page_ref_irq(page); - if (PageDirty(page)) - goto failed; - BUG_ON(PagePrivate(page)); __remove_from_page_cache(page); unlock_page_ref_irq(page); ClearPageUptodate(page); page_cache_release(page); /* pagecache ref */ return 1; -failed: - unlock_page_ref_irq(page); - return 0; } /** ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-20 22:15 ` Peter Zijlstra @ 2006-12-20 22:20 ` Peter Zijlstra 2006-12-20 22:49 ` Linus Torvalds 2006-12-21 2:36 ` [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) Trond Myklebust 2 siblings, 0 replies; 311+ messages in thread From: Peter Zijlstra @ 2006-12-20 22:20 UTC (permalink / raw) To: Linus Torvalds Cc: Martin Michlmayr, Hugh Dickins, Nick Piggin, Arjan van de Ven, Andrei Popa, Andrew Morton, Linux Kernel Mailing List, Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens, Arnd Bergmann, gordonfarquharson On Wed, 2006-12-20 at 23:15 +0100, Peter Zijlstra wrote: > I think this is also needed: See also: http://marc.theaimsgroup.com/?l=linux-kernel&m=116603599904278&w=2 > --- > mm/truncate.c | 7 +------ > 1 file changed, 1 insertion(+), 6 deletions(-) > > Index: linux-2.6/mm/truncate.c > =================================================================== > --- linux-2.6.orig/mm/truncate.c > +++ linux-2.6/mm/truncate.c > @@ -320,19 +320,14 @@ invalidate_complete_page2(struct address > if (PagePrivate(page) && !try_to_release_page(page, GFP_KERNEL)) > return 0; > > + cancel_dirty_page(page, PAGE_CACHE_SIZE); > lock_page_ref_irq(page); > - if (PageDirty(page)) > - goto failed; > - > BUG_ON(PagePrivate(page)); > __remove_from_page_cache(page); > unlock_page_ref_irq(page); > ClearPageUptodate(page); > page_cache_release(page); /* pagecache ref */ > return 1; > -failed: > - unlock_page_ref_irq(page); > - return 0; > } > > /** > > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-20 22:15 ` Peter Zijlstra 2006-12-20 22:20 ` Peter Zijlstra @ 2006-12-20 22:49 ` Linus Torvalds 2006-12-20 23:03 ` Peter Zijlstra 2006-12-21 2:36 ` [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) Trond Myklebust 2 siblings, 1 reply; 311+ messages in thread From: Linus Torvalds @ 2006-12-20 22:49 UTC (permalink / raw) To: Peter Zijlstra Cc: Martin Michlmayr, Hugh Dickins, Nick Piggin, Arjan van de Ven, Andrei Popa, Andrew Morton, Linux Kernel Mailing List, Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens, Arnd Bergmann, gordonfarquharson On Wed, 20 Dec 2006, Peter Zijlstra wrote: > > I think this is also needed: Yeah, that looks about right. Although I think it should go above the "try_to_release_page()", because right now we do that "ttrp()" with the dirty bit set, and we should let the low-level filesystem just know that it's simply not interesting any more (and, indeed, "try_to_free_buffers()" too, for that matter). Anyway, I think that's a detail. I'd rather know whether this all actually makes any difference what-so-ever to the corruption behaviour of Andrei &co. Maybe the UP ARM case is some strange dcache alias issue with PIO IDE, and the only reason that started showing up at the same time is the different IO loads. Who knows. [ Although I think you may have been on the right track with that dcache flushing stuff in "page_mkclean()".. It might not have been quite all there, but I think we should go back and look very closely at page_mkclean() regardless of any other issues! ] So far, my whole "cancel_dirty_page/clean_page_dirty_for_io" patch has really been just a "try to make the code _look_ sane. I don't think we have a single report that the patch actually makes any difference yet. Linus ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-20 22:49 ` Linus Torvalds @ 2006-12-20 23:03 ` Peter Zijlstra 2006-12-21 9:16 ` Martin Schwidefsky 0 siblings, 1 reply; 311+ messages in thread From: Peter Zijlstra @ 2006-12-20 23:03 UTC (permalink / raw) To: Linus Torvalds Cc: Martin Michlmayr, Hugh Dickins, Nick Piggin, Arjan van de Ven, Andrei Popa, Andrew Morton, Linux Kernel Mailing List, Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens, Arnd Bergmann, gordonfarquharson On Wed, 2006-12-20 at 14:49 -0800, Linus Torvalds wrote: > > On Wed, 20 Dec 2006, Peter Zijlstra wrote: > > > > I think this is also needed: > > Yeah, that looks about right. Although I think it should go above the > "try_to_release_page()", because right now we do that "ttrp()" with the > dirty bit set, and we should let the low-level filesystem just know that > it's simply not interesting any more (and, indeed, "try_to_free_buffers()" > too, for that matter). That makes NFS unhappy, see nfs_release_page(). > Anyway, I think that's a detail. I'd rather know whether this all actually > makes any difference what-so-ever to the corruption behaviour of Andrei > &co. Yeah, I have to tinker with my test setup to make it fail again. Maybe I have to add more seeds, that seemed to make a difference, it was impossible to trigger with a single seed. FWIW I also added some scribble past i_size checks in nobh_writepage() and block_write_full_page(). FWIW2 I straced rtorrent for a bit and it does an aweful lot of mmap calls and relatively few msync(MS_ASYNC);munmap(), and no truncate apart from creating sparse files at the beginning. > Maybe the UP ARM case is some strange dcache alias issue with PIO IDE, and > the only reason that started showing up at the same time is the different > IO loads. Who knows. > > [ Although I think you may have been on the right track with that dcache > flushing stuff in "page_mkclean()".. It might not have been quite > all there, but I think we should go back and look very closely at > page_mkclean() regardless of any other issues! ] current version Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- mm/rmap.c | 23 +++++++++++++---------- 1 file changed, 13 insertions(+), 10 deletions(-) Index: linux-2.6/mm/rmap.c =================================================================== --- linux-2.6.orig/mm/rmap.c +++ linux-2.6/mm/rmap.c @@ -432,7 +432,7 @@ static int page_mkclean_one(struct page { struct mm_struct *mm = vma->vm_mm; unsigned long address; - pte_t *pte, entry; + pte_t *pte; spinlock_t *ptl; int ret = 0; @@ -444,17 +444,18 @@ static int page_mkclean_one(struct page if (!pte) goto out; - if (!pte_dirty(*pte) && !pte_write(*pte)) - goto unlock; + while (pte_dirty(*pte) || pte_write(*pte)) { + pte_t entry; - entry = ptep_get_and_clear(mm, address, pte); - entry = pte_mkclean(entry); - entry = pte_wrprotect(entry); - ptep_establish(vma, address, pte, entry); - lazy_mmu_prot_update(entry); - ret = 1; + flush_cache_page(vma, address, pte_pfn(*pte)); + entry = ptep_clear_flush(vma, address, pte); + entry = pte_wrprotect(entry); + entry = pte_mkclean(entry); + ptep_establish(vma, address, pte, entry); + lazy_mmu_prot_update(entry); + ret = 1; + } -unlock: pte_unmap_unlock(pte, ptl); out: return ret; @@ -489,6 +490,8 @@ int page_mkclean(struct page *page) if (mapping) ret = page_mkclean_file(mapping, page); } + if (page_test_and_clear_dirty(page)) + ret = 1; return ret; } > So far, my whole "cancel_dirty_page/clean_page_dirty_for_io" patch has > really been just a "try to make the code _look_ sane. I don't think we > have a single report that the patch actually makes any difference yet. I failed to compile a kernel with that patch (100% iowait and a bunch of processes stuck in D state), but sysrq-t was borked (only numbers no symbols) have yet to retry - I noticed you kicked the unwinder?. ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-20 23:03 ` Peter Zijlstra @ 2006-12-21 9:16 ` Martin Schwidefsky 2006-12-21 9:20 ` Peter Zijlstra 0 siblings, 1 reply; 311+ messages in thread From: Martin Schwidefsky @ 2006-12-21 9:16 UTC (permalink / raw) To: Peter Zijlstra Cc: Linus Torvalds, Martin Michlmayr, Hugh Dickins, Nick Piggin, Arjan van de Ven, Andrei Popa, Andrew Morton, Linux Kernel Mailing List, Florian Weimer, Marc Haber, Heiko Carstens, Arnd Bergmann, gordonfarquharson On Thu, 2006-12-21 at 00:03 +0100, Peter Zijlstra wrote: > current version Nitpicking .. > @@ -444,17 +444,18 @@ static int page_mkclean_one(struct page > if (!pte) > goto out; > > - if (!pte_dirty(*pte) && !pte_write(*pte)) > - goto unlock; > + while (pte_dirty(*pte) || pte_write(*pte)) { > + pte_t entry; > > - entry = ptep_get_and_clear(mm, address, pte); > - entry = pte_mkclean(entry); > - entry = pte_wrprotect(entry); > - ptep_establish(vma, address, pte, entry); > - lazy_mmu_prot_update(entry); > - ret = 1; > + flush_cache_page(vma, address, pte_pfn(*pte)); > + entry = ptep_clear_flush(vma, address, pte); > + entry = pte_wrprotect(entry); > + entry = pte_mkclean(entry); > + ptep_establish(vma, address, pte, entry); Now you are flushing the tlb twice. ptep_clear_flush clears the pte and flushes the tlb, ptep_establish sets the new pte and flushes the tlb. Not good. Use set_pte_at instead of the ptep_establish. > + lazy_mmu_prot_update(entry); > + ret = 1; > + } > > -unlock: > pte_unmap_unlock(pte, ptl); > out: > return ret; -- blue skies, Martin. Martin Schwidefsky Linux for zSeries Development & Services IBM Deutschland Entwicklung GmbH "Reality continues to ruin my life." - Calvin. ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-21 9:16 ` Martin Schwidefsky @ 2006-12-21 9:20 ` Peter Zijlstra 2006-12-21 9:26 ` Martin Schwidefsky 2006-12-21 20:01 ` Linus Torvalds 0 siblings, 2 replies; 311+ messages in thread From: Peter Zijlstra @ 2006-12-21 9:20 UTC (permalink / raw) To: schwidefsky Cc: Linus Torvalds, Martin Michlmayr, Hugh Dickins, Nick Piggin, Arjan van de Ven, Andrei Popa, Andrew Morton, Linux Kernel Mailing List, Florian Weimer, Marc Haber, Heiko Carstens, Arnd Bergmann, gordonfarquharson On Thu, 2006-12-21 at 10:16 +0100, Martin Schwidefsky wrote: > On Thu, 2006-12-21 at 00:03 +0100, Peter Zijlstra wrote: > > current version > > Nitpicking .. > > > @@ -444,17 +444,18 @@ static int page_mkclean_one(struct page > > if (!pte) > > goto out; > > > > - if (!pte_dirty(*pte) && !pte_write(*pte)) > > - goto unlock; > > + while (pte_dirty(*pte) || pte_write(*pte)) { > > + pte_t entry; > > > > - entry = ptep_get_and_clear(mm, address, pte); > > - entry = pte_mkclean(entry); > > - entry = pte_wrprotect(entry); > > - ptep_establish(vma, address, pte, entry); > > - lazy_mmu_prot_update(entry); > > - ret = 1; > > + flush_cache_page(vma, address, pte_pfn(*pte)); > > + entry = ptep_clear_flush(vma, address, pte); > > + entry = pte_wrprotect(entry); > > + entry = pte_mkclean(entry); > > + ptep_establish(vma, address, pte, entry); > > Now you are flushing the tlb twice. ptep_clear_flush clears the pte and > flushes the tlb, ptep_establish sets the new pte and flushes the tlb. > Not good. Use set_pte_at instead of the ptep_establish. Yeah, sorry, I already noticed and corrected that :-| Also, I'm dubious about the while thing and stuck a WARN_ON(ret) thing at the beginning of the loop. flush_tlb_page() does IPI the other cpus to flush their tlb too, so there should not be a SMP race, Arjan? > > + lazy_mmu_prot_update(entry); > > + ret = 1; > > + } > > > > -unlock: > > pte_unmap_unlock(pte, ptl); > > out: > > return ret; > ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-21 9:20 ` Peter Zijlstra @ 2006-12-21 9:26 ` Martin Schwidefsky 2006-12-21 20:01 ` Linus Torvalds 1 sibling, 0 replies; 311+ messages in thread From: Martin Schwidefsky @ 2006-12-21 9:26 UTC (permalink / raw) To: Peter Zijlstra Cc: Linus Torvalds, Martin Michlmayr, Hugh Dickins, Nick Piggin, Arjan van de Ven, Andrei Popa, Andrew Morton, Linux Kernel Mailing List, Florian Weimer, Marc Haber, Heiko Carstens, Arnd Bergmann, gordonfarquharson On Thu, 2006-12-21 at 10:20 +0100, Peter Zijlstra wrote: > > Now you are flushing the tlb twice. ptep_clear_flush clears the pte and > > flushes the tlb, ptep_establish sets the new pte and flushes the tlb. > > Not good. Use set_pte_at instead of the ptep_establish. > > Yeah, sorry, I already noticed and corrected that :-| > > Also, I'm dubious about the while thing and stuck a WARN_ON(ret) thing > at the beginning of the loop. flush_tlb_page() does IPI the other cpus > to flush their tlb too, so there should not be a SMP race, Arjan? The while loop is protected by the pte lock and flush_tlb_page has to remove the tlbs on all cpus. So yes, I think the while loop is not necessary. -- blue skies, Martin. Martin Schwidefsky Linux for zSeries Development & Services IBM Deutschland Entwicklung GmbH "Reality continues to ruin my life." - Calvin. ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-21 9:20 ` Peter Zijlstra 2006-12-21 9:26 ` Martin Schwidefsky @ 2006-12-21 20:01 ` Linus Torvalds 2006-12-28 0:00 ` Martin Schwidefsky 1 sibling, 1 reply; 311+ messages in thread From: Linus Torvalds @ 2006-12-21 20:01 UTC (permalink / raw) To: Peter Zijlstra Cc: schwidefsky, Martin Michlmayr, Hugh Dickins, Nick Piggin, Arjan van de Ven, Andrei Popa, Andrew Morton, Linux Kernel Mailing List, Florian Weimer, Marc Haber, Heiko Carstens, Arnd Bergmann, gordonfarquharson On Thu, 21 Dec 2006, Peter Zijlstra wrote: > > Also, I'm dubious about the while thing and stuck a WARN_ON(ret) thing > at the beginning of the loop. flush_tlb_page() does IPI the other cpus > to flush their tlb too, so there should not be a SMP race, Arjan? Now, the reason I think the loop may be needed is: CPU#0 CPU#1 ----- ----- load old PTE entry clear dirty and WP bits write to page using old PTE NOT CHECKING that the new one is write-protected, and just setting the dirty bit blindly (but atomically) flush_tlb_page() TLB flushed, but we now have a page that is marked dirty and unwritable in the page tables, and we will mark it clean in "struct page *" Now, the scary thing is, IF a CPU does this, then the way we do all this, we may actually have the following sequence: CPU#0 CPU#1 ----- ----- load old PTE entry ptep_clear_flush(): atomic "set dirty bit" sequence PTEP now contains 0000040 !!! flush_tlb_page(); TLB flushed, but PTEP is still "dirty zero" write the clear/readonly PTE THE DIRTY BIT WAS LOST! which might actually explain this bug. I personally _thought_ that Intel CPU's don't actually do an "set dirty bit atomically" sequence, but more of a "set dirty bit but trap if the TLB is nonpresent" thing, but I have absolutely no proof for that. Anyway, IF this is the case, then the following patch may or may not fix things. It avoids things by never overwriting a PTE entry, not even the "cleared" one. It always does an atomic "xchg()" with a valid new entry, and looks at the old bits. What do you guys think? Does something like this work out for S/390 too? I tried to make that "ptep_flush_dirty()" concept work for architectures that hide the dirty bit somewhere else too, but.. It actually simplifies the architecture-specific code (you just need to implement a trivial "ptep_exchange()" and "ptep_flush_dirty()" macro), but I only did x86-64 and i386, and while I've booted with this, I haven't really given the thing a lot of really _deep_ thought. But I think this might be safer, as per above.. And it _might_ actually explain the problem. Exactly because the "ptep_clear() + blindly assign to ptep" might lose a dirty bit that was written by another CPU. But this really does depend on what a CPU does when it marks a page dirty. Does it just blindly write the dirty bit? Or does it actually _validate_ that the old page table entry was still present and writable? This patch makes no assumptions. It should work even if a CPU just writes the dirty bit blindly, and the only expectation is that the page tables can be accessed atomically (which had _better_ be true on any SMP architecture) Arjan, can you please check within Intel, and ask what the "proper" sequence for doing something like this is? Linus ---- commit 301d2d53ca0e5d2f61b1c1c259da410c7ee6d6a7 Author: Linus Torvalds <torvalds@woody.osdl.org> Date: Thu Dec 21 11:11:05 2006 -0800 Rewrite the page table "clear dirty and writable" accesses This is much simpler for most architectures, and allows us to do the dirty and writable clear in a single operation without any races or any double flushes. It's also much more careful: we never overwrite the old dirty bits at any time, and always make sure to do atomic memory ops to exchange and see the old value. Signed-off-by: Linus Torvalds <torvalds@osdl.org> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h index 9d774d0..8879f1d 100644 --- a/include/asm-generic/pgtable.h +++ b/include/asm-generic/pgtable.h @@ -61,31 +61,6 @@ do { \ }) #endif -#ifndef __HAVE_ARCH_PTEP_TEST_AND_CLEAR_DIRTY -#define ptep_test_and_clear_dirty(__vma, __address, __ptep) \ -({ \ - pte_t __pte = *__ptep; \ - int r = 1; \ - if (!pte_dirty(__pte)) \ - r = 0; \ - else \ - set_pte_at((__vma)->vm_mm, (__address), (__ptep), \ - pte_mkclean(__pte)); \ - r; \ -}) -#endif - -#ifndef __HAVE_ARCH_PTEP_CLEAR_DIRTY_FLUSH -#define ptep_clear_flush_dirty(__vma, __address, __ptep) \ -({ \ - int __dirty; \ - __dirty = ptep_test_and_clear_dirty(__vma, __address, __ptep); \ - if (__dirty) \ - flush_tlb_page(__vma, __address); \ - __dirty; \ -}) -#endif - #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR #define ptep_get_and_clear(__mm, __address, __ptep) \ ({ \ diff --git a/include/asm-i386/pgtable.h b/include/asm-i386/pgtable.h index e6a4723..b61d6f9 100644 --- a/include/asm-i386/pgtable.h +++ b/include/asm-i386/pgtable.h @@ -300,18 +300,20 @@ do { \ flush_tlb_page(vma, address); \ } while (0) -#define __HAVE_ARCH_PTEP_CLEAR_DIRTY_FLUSH -#define ptep_clear_flush_dirty(vma, address, ptep) \ -({ \ - int __dirty; \ - __dirty = pte_dirty(*(ptep)); \ - if (__dirty) { \ - clear_bit(_PAGE_BIT_DIRTY, &(ptep)->pte_low); \ - pte_update_defer((vma)->vm_mm, (address), (ptep)); \ - flush_tlb_page(vma, address); \ - } \ - __dirty; \ -}) +/* + * "ptep_exchange()" can be used to atomically change a set of + * page table protection bits, returning the old ones (the dirty + * and accessed bits in particular, since they are set by hw). + * + * "ptep_flush_dirty()" then returns the dirty status of the + * page (on x86-64, we just look at the dirty bit in the returned + * pte, but some other architectures have the dirty bits in + * other places than the page tables). + */ +#define ptep_exchange(vma, address, ptep, old, new) \ + (old).pte_low = xchg(&(ptep)->pte_low, (new).pte_low); +#define ptep_flush_dirty(vma, address, ptep, old) \ + pte_dirty(old) #define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH #define ptep_clear_flush_young(vma, address, ptep) \ diff --git a/include/asm-x86_64/pgtable.h b/include/asm-x86_64/pgtable.h index 59901c6..07754b5 100644 --- a/include/asm-x86_64/pgtable.h +++ b/include/asm-x86_64/pgtable.h @@ -283,12 +283,20 @@ static inline pte_t pte_clrhuge(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) & struct vm_area_struct; -static inline int ptep_test_and_clear_dirty(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep) -{ - if (!pte_dirty(*ptep)) - return 0; - return test_and_clear_bit(_PAGE_BIT_DIRTY, &ptep->pte); -} +/* + * "ptep_exchange()" can be used to atomically change a set of + * page table protection bits, returning the old ones (the dirty + * and accessed bits in particular, since they are set by hw). + * + * "ptep_flush_dirty()" then returns the dirty status of the + * page (on x86-64, we just look at the dirty bit in the returned + * pte, but some other architectures have the dirty bits in + * other places than the page tables). + */ +#define ptep_exchange(vma, address, ptep, old, new) \ + (old).pte = xchg(&(ptep)->pte, (new).pte); +#define ptep_flush_dirty(vma, address, ptep, old) \ + pte_dirty(old) static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep) { diff --git a/mm/rmap.c b/mm/rmap.c index d8a842a..a028803 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -432,7 +432,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma) { struct mm_struct *mm = vma->vm_mm; unsigned long address; - pte_t *pte, entry; + pte_t *ptep; spinlock_t *ptl; int ret = 0; @@ -440,22 +440,24 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma) if (address == -EFAULT) goto out; - pte = page_check_address(page, mm, address, &ptl); - if (!pte) - goto out; - - if (!pte_dirty(*pte) && !pte_write(*pte)) - goto unlock; - - entry = ptep_get_and_clear(mm, address, pte); - entry = pte_mkclean(entry); - entry = pte_wrprotect(entry); - ptep_establish(vma, address, pte, entry); - lazy_mmu_prot_update(entry); - ret = 1; - -unlock: - pte_unmap_unlock(pte, ptl); + ptep = page_check_address(page, mm, address, &ptl); + if (ptep) { + pte_t old, new; + + old = *ptep; + new = pte_wrprotect(pte_mkclean(old)); + if (!pte_same(old, new)) { + for (;;) { + flush_cache_page(vma, address, page_to_pfn(page)); + ptep_exchange(vma, address, ptep, old, new); + if (pte_same(old, new)) + break; + ret |= ptep_flush_dirty(vma, address, ptep, old); + flush_tlb_page(vma, address); + } + } + pte_unmap_unlock(pte, ptl); + } out: return ret; } ^ permalink raw reply related [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-21 20:01 ` Linus Torvalds @ 2006-12-28 0:00 ` Martin Schwidefsky 2006-12-28 0:42 ` Linus Torvalds 0 siblings, 1 reply; 311+ messages in thread From: Martin Schwidefsky @ 2006-12-28 0:00 UTC (permalink / raw) To: Linus Torvalds Cc: Peter Zijlstra, Martin Michlmayr, Hugh Dickins, Nick Piggin, Arjan van de Ven, Andrei Popa, Andrew Morton, Linux Kernel Mailing List, Florian Weimer, Marc Haber, Heiko Carstens, Arnd Bergmann, gordonfarquharson On Thu, 2006-12-21 at 12:01 -0800, Linus Torvalds wrote: > What do you guys think? Does something like this work out for S/390 too? I > tried to make that "ptep_flush_dirty()" concept work for architectures > that hide the dirty bit somewhere else too, but.. For s390 there are two aspects to consider: 1) the pte values are 100% software controlled. They only change because a cpu stored a value to it or issued one of the specialized instructions (csp, ipte and idte). The ptep_flush_dirty would be a nop for s390. 2) ptep_exchange is a bit dangerous. For s390 we need a lock that protects the software controlled updates of the ptes. The reason is the ipte instruction. It is implemented by the machine microcode in a non-atomic way in regard to the memory. It reads the byte of the pte that contains the invalid bit, flushes the tlb entries for it and then writes back the byte with the invalid bit set. The microcode makes sure that this pte cannot be used for form a new tlb on any cpu while the ipte is in progress. That means a compare-and-swap semantics on ptes won't work together with the ipte optimization. As long as there is the pte lock that protects all software accesses to the pte we are fine. But if any code expects that ptep_exchange does something like an xchg things break. -- blue skies, Martin. Martin Schwidefsky Linux for zSeries Development & Services IBM Deutschland Entwicklung GmbH "Reality continues to ruin my life." - Calvin. ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-28 0:00 ` Martin Schwidefsky @ 2006-12-28 0:42 ` Linus Torvalds 2006-12-28 0:52 ` [PATCH] mm: fix page_mkclean_one David Miller 0 siblings, 1 reply; 311+ messages in thread From: Linus Torvalds @ 2006-12-28 0:42 UTC (permalink / raw) To: Martin Schwidefsky Cc: Peter Zijlstra, Martin Michlmayr, Hugh Dickins, Nick Piggin, Arjan van de Ven, Andrei Popa, Andrew Morton, Linux Kernel Mailing List, Florian Weimer, Marc Haber, Heiko Carstens, Arnd Bergmann, gordonfarquharson On Thu, 28 Dec 2006, Martin Schwidefsky wrote: > > For s390 there are two aspects to consider: > 1) the pte values are 100% software controlled. That's fine. In that situation, you shouldn't need any atomic ops at all, I think all our sw page-table operations are already done under the pte lock. The reason x86 needs to be careful is exactly the fact that the hardware will obviously do a lot on its own, and the hardware is _not_ going to honor our page table locking ;) In an all-sw situation, a lot of this should be easier. S390 has _other_ things that are inconvenient (the strange "dirty bit is not in the page tables" thing that makes it look different from everybody else), but hey, it's a balance.. So for s390, ptep_exchange() in my example should be able to be a simple "load old value and store new one", assuming everybody honors the pte lock (and they _should_). Linus ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one 2006-12-28 0:42 ` Linus Torvalds @ 2006-12-28 0:52 ` David Miller 0 siblings, 0 replies; 311+ messages in thread From: David Miller @ 2006-12-28 0:52 UTC (permalink / raw) To: torvalds Cc: schwidefsky, a.p.zijlstra, tbm, hugh, nickpiggin, arjan, andrei.popa, akpm, linux-kernel, fw, mh+linux-kernel, heiko.carstens, arnd.bergmann, gordonfarquharson From: Linus Torvalds <torvalds@osdl.org> Date: Wed, 27 Dec 2006 16:42:40 -0800 (PST) > That's fine. In that situation, you shouldn't need any atomic ops at all, > I think all our sw page-table operations are already done under the pte > lock. This is true, but there is one subtlety to this I want to point out in passing. That lock can possibly only protect a page of PTEs. When NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS, the locking is done per page of PTEs, not for all of the page tables of an address space at once. What this means is that it's really difficult to forcefully block out all page table operations for a given mm, and I actually needed to do something like this on sparc64 (when growing the TLB lookup hash table, you can't let any PTEs change state while the table is changing). For my case, I added a spinlock to mm->context since actually what I need is to block modifications to the hash table itself during PTE changes. ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-20 22:15 ` Peter Zijlstra 2006-12-20 22:20 ` Peter Zijlstra 2006-12-20 22:49 ` Linus Torvalds @ 2006-12-21 2:36 ` Trond Myklebust 2006-12-21 8:10 ` Peter Zijlstra 2 siblings, 1 reply; 311+ messages in thread From: Trond Myklebust @ 2006-12-21 2:36 UTC (permalink / raw) To: Peter Zijlstra Cc: Linus Torvalds, Martin Michlmayr, Hugh Dickins, Nick Piggin, Arjan van de Ven, Andrei Popa, Andrew Morton, Linux Kernel Mailing List, Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens, Arnd Bergmann, gordonfarquharson On Wed, 2006-12-20 at 23:15 +0100, Peter Zijlstra wrote: > I think this is also needed: NAK invalidate_inode_pages2() should _not_ be pretending that dirty pages are clean. This patch is incorrect both for the NFS usage and for the directIO usage. In the latter case, if someone has the page mmapped, resulting in the page getting marked as dirty _after_ a directIO write, then it would be wrong to discard that data. Only dirty data from _before_ the directIO write should needs to be discarded (and that is achieved by unmapping, then cleaning the page prior to the directIO call)... For the NFS case, the race is a bit more tricky, since you have the "unstable write" case which means that the page is neither marked as dirty, nor is entirely clean ('cos we don't know that the server has committed the data to permanent storage yet). Cheers Trond > --- > mm/truncate.c | 7 +------ > 1 file changed, 1 insertion(+), 6 deletions(-) > > Index: linux-2.6/mm/truncate.c > =================================================================== > --- linux-2.6.orig/mm/truncate.c > +++ linux-2.6/mm/truncate.c > @@ -320,19 +320,14 @@ invalidate_complete_page2(struct address > if (PagePrivate(page) && !try_to_release_page(page, GFP_KERNEL)) > return 0; > > + cancel_dirty_page(page, PAGE_CACHE_SIZE); > lock_page_ref_irq(page); > - if (PageDirty(page)) > - goto failed; > - > BUG_ON(PagePrivate(page)); > __remove_from_page_cache(page); > unlock_page_ref_irq(page); > ClearPageUptodate(page); > page_cache_release(page); /* pagecache ref */ > return 1; > -failed: > - unlock_page_ref_irq(page); > - return 0; > } > > /** > > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-21 2:36 ` [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) Trond Myklebust @ 2006-12-21 8:10 ` Peter Zijlstra 0 siblings, 0 replies; 311+ messages in thread From: Peter Zijlstra @ 2006-12-21 8:10 UTC (permalink / raw) To: Trond Myklebust Cc: Linus Torvalds, Martin Michlmayr, Hugh Dickins, Nick Piggin, Arjan van de Ven, Andrei Popa, Andrew Morton, Linux Kernel Mailing List, Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens, Arnd Bergmann, gordonfarquharson On Wed, 2006-12-20 at 21:36 -0500, Trond Myklebust wrote: > On Wed, 2006-12-20 at 23:15 +0100, Peter Zijlstra wrote: > > I think this is also needed: > > NAK > > invalidate_inode_pages2() should _not_ be pretending that dirty pages > are clean. This patch is incorrect both for the NFS usage and for the > directIO usage. > > In the latter case, if someone has the page mmapped, resulting in the > page getting marked as dirty _after_ a directIO write, then it would be > wrong to discard that data. Only dirty data from _before_ the directIO > write should needs to be discarded (and that is achieved by unmapping, > then cleaning the page prior to the directIO call)... > > For the NFS case, the race is a bit more tricky, since you have the > "unstable write" case which means that the page is neither marked as > dirty, nor is entirely clean ('cos we don't know that the server has > committed the data to permanent storage yet). Then this patch: http://kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc1/2.6.20-rc1-mm1/broken-out/nfs-fix-nr_file_dirty-underflow.patch is equally wrong, right? ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-20 19:50 ` Linus Torvalds ` (2 preceding siblings ...) 2006-12-20 22:15 ` Peter Zijlstra @ 2006-12-20 23:24 ` David Chinner 2006-12-20 23:55 ` Linus Torvalds 2006-12-20 23:32 ` Andrew Morton ` (2 subsequent siblings) 6 siblings, 1 reply; 311+ messages in thread From: David Chinner @ 2006-12-20 23:24 UTC (permalink / raw) To: Linus Torvalds Cc: Martin Michlmayr, Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven, Andrei Popa, Andrew Morton, Linux Kernel Mailing List, Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens, Arnd Bergmann, gordonfarquharson On Wed, Dec 20, 2006 at 11:50:50AM -0800, Linus Torvalds wrote: > > > On Wed, 20 Dec 2006, Linus Torvalds wrote: > > > > So that's why I've been harping on the fact that I think we simply do > > really wrong things with PG_dirty at times [ ... ] > > Ok, I'll just put my money where my mouth is, and suggest a patch like > THIS instead. > > This one clears up all the issues I find irritating: > > - "test_clear_page_dirty()" is insane, both conceptually and as an > implementation. "Give me a 'C', give me an 'R', give me an 'A', give me > a 'P'". > > So rip out that mindfart entirely. > > - "clear_page_dirty()" is badly named, and should be about CANCELLING the > dirty bit, and must never be called with pages mapped anyway. So throw > that out too, and replace it with a new function: > > void cancel_dirty_page(struct page *page, unsigned int accounting_size); > > - "clear_page_dirty_for_io()" is fine. > > And with that, I then either rip out any old users of > "test_clear_page_dirty()" or "clear_page_dirty()", and if appropriate (and > it's realy lonly appropriate for "truncate()", I replace them with the new > "cancel_dirty_page()". Most of the time, they should just be deleted > entirely. > > NOTE NOTE NOTE! I _only_ did enough to make things compile for my > particular configuration. That means that right now the following > filesystems are broken with this patch (because they use the totally > broken old crap): > > CIFS, FUSE, JFS, ReiserFS, XFS XFS appears to call clear_page_dirty to get the mapping tree dirty tag set correctly at the same time the page dirty flag is cleared. I note that this can be done by set_page_writeback() if we clear the dirty flag on the page first when we are writing back the entire page. Hence it seems to me that the XFS call to clear_page_dirty() could easily be substituted by clear_page_dirty_for_io() followed by a call to set_page_writeback() to get the mapping tree tags set correctly after the page has been marked clean. Does this make sense (even without the posted patch)? --- fs/xfs/linux-2.6/xfs_aops.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_aops.c =================================================================== --- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_aops.c 2006-12-19 12:22:47.000000000 +1100 +++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_aops.c 2006-12-21 10:15:04.545375877 +1100 @@ -340,9 +340,9 @@ xfs_start_page_writeback( { ASSERT(PageLocked(page)); ASSERT(!PageWriteback(page)); - set_page_writeback(page); if (clear_dirty) - clear_page_dirty(page); + clear_page_dirty_for_io(page); + set_page_writeback(page); unlock_page(page); if (!buffers) { end_page_writeback(page); Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-20 23:24 ` David Chinner @ 2006-12-20 23:55 ` Linus Torvalds 2006-12-21 1:20 ` David Chinner 0 siblings, 1 reply; 311+ messages in thread From: Linus Torvalds @ 2006-12-20 23:55 UTC (permalink / raw) To: David Chinner Cc: Martin Michlmayr, Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven, Andrei Popa, Andrew Morton, Linux Kernel Mailing List, Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens, Arnd Bergmann, gordonfarquharson On Thu, 21 Dec 2006, David Chinner wrote: > > XFS appears to call clear_page_dirty to get the mapping tree dirty > tag set correctly at the same time the page dirty flag is cleared. I > note that this can be done by set_page_writeback() if we clear the > dirty flag on the page first when we are writing back the entire page. Yes. I think the XFS routine should just use "clear_page_dirty_fir_io()", since that matches what it actually wants to do (surprise surprise, it's going to write it out). HOWEVER. Why is it conditional? Can somebody who understands XFS tell me why "clear_dirty" would ever be 0? I can grep the sources, and I see that it's an unconditional 1 in one call-site, but then in the other one it does xfs_start_page_writeback(page, wbc, !page_dirty, count); and that part just blows my mind. Why would you do a xfs_start_page_writeback() and _not_ write the page out? Is this for a partial-page-only case? Anyway, your patch looks fine. It seems to be the right thing to do. I'm just wondering why we're not always cleaning the whole page, and why we'd not set it unconditionally dirty? Linus ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-20 23:55 ` Linus Torvalds @ 2006-12-21 1:20 ` David Chinner 0 siblings, 0 replies; 311+ messages in thread From: David Chinner @ 2006-12-21 1:20 UTC (permalink / raw) To: Linus Torvalds Cc: David Chinner, Martin Michlmayr, Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven, Andrei Popa, Andrew Morton, Linux Kernel Mailing List, Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens, Arnd Bergmann, gordonfarquharson On Wed, Dec 20, 2006 at 03:55:25PM -0800, Linus Torvalds wrote: > On Thu, 21 Dec 2006, David Chinner wrote: > > > > XFS appears to call clear_page_dirty to get the mapping tree dirty > > tag set correctly at the same time the page dirty flag is cleared. I > > note that this can be done by set_page_writeback() if we clear the > > dirty flag on the page first when we are writing back the entire page. > > Yes. I think the XFS routine should just use "clear_page_dirty_fir_io()", > since that matches what it actually wants to do (surprise surprise, it's > going to write it out). Yup ;) > HOWEVER. Why is it conditional? Can somebody who understands XFS tell me > why "clear_dirty" would ever be 0? I can grep the sources, and I see that > it's an unconditional 1 in one call-site, but then in the other one it > does > > xfs_start_page_writeback(page, wbc, !page_dirty, count); page dirty starts at the number of dirty buffers on the page, and as we map each dirty buffer into the I/O we decrement the page dirty count. Hence if we map all of the buffers into the I/O, we are cleaning the entire page and hence we can clear the dirty state on the page. > and that part just blows my mind. Why would you do a > xfs_start_page_writeback() and _not_ write the page out? Is this for a > partial-page-only case? Yes, partial-page-only case when doing speculative write clustering. We'll hit this when an extent boundary is not page aligned (fs block size < page size case) and we need to issue at least two separate I/Os to clean the page. Because we leave the page dirty and we are working ahead of the index in generic_writepages() we'll get the rest of the page flushed when we return back to generic_writepages() as the page is still dirty in the mapping tree.... > Anyway, your patch looks fine. It seems to be the right thing to do. Ok, thanks, Linus. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-20 19:50 ` Linus Torvalds ` (3 preceding siblings ...) 2006-12-20 23:24 ` David Chinner @ 2006-12-20 23:32 ` Andrew Morton 2006-12-20 23:55 ` Linus Torvalds 2006-12-21 7:32 ` Gordon Farquharson 2006-12-21 11:21 ` Martin Michlmayr 6 siblings, 1 reply; 311+ messages in thread From: Andrew Morton @ 2006-12-20 23:32 UTC (permalink / raw) To: Linus Torvalds Cc: Martin Michlmayr, Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven, Andrei Popa, Linux Kernel Mailing List, Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens, Arnd Bergmann, gordonfarquharson, Chen, Kenneth W On Wed, 20 Dec 2006 11:50:50 -0800 (PST) Linus Torvalds <torvalds@osdl.org> wrote: > Ok, I'll just put my money where my mouth is, and suggest a patch like > THIS instead. > > ... > > diff --git a/fs/buffer.c b/fs/buffer.c > index d1f1b54..263f88e 100644 > --- a/fs/buffer.c > +++ b/fs/buffer.c > @@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *page) > int ret = 0; > > BUG_ON(!PageLocked(page)); > - if (PageWriteback(page)) > + if (PageDirty(page) || PageWriteback(page)) > return 0; > > if (mapping == NULL) { /* can this still happen? */ > @@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *page) > spin_lock(&mapping->private_lock); > ret = drop_buffers(page, &buffers_to_free); > spin_unlock(&mapping->private_lock); > - if (ret) { > - /* > - * If the filesystem writes its buffers by hand (eg ext3) > - * then we can have clean buffers against a dirty page. We > - * clean the page here; otherwise later reattachment of buffers > - * could encounter a non-uptodate page, which is unresolvable. > - * This only applies in the rare case where try_to_free_buffers > - * succeeds but the page is not freed. > - * > - * Also, during truncate, discard_buffer will have marked all > - * the page's buffers clean. We discover that here and clean > - * the page also. > - */ > - if (test_clear_page_dirty(page)) > - task_io_account_cancelled_write(PAGE_CACHE_SIZE); > - } I think this will be OK, because vmscan has just run ->writepage anyway. But we will need to make changes to truncate_complete_page() - make it run cancel_dirty_page() before it runs do_invalidatepage(). I don't think there's anything preventing zap_pte_range() or perhaps a pagefault from coming in and dirtying this page after we've tested PageDirty(). That could leave us with a dirty, non-uptodate page with no buffers, which is very bad. But this situation is hopefully impossible, because if the page is not uptodate then the first thing a pagefault will do is bring it uptodate, which involves locking it. And if zap_pte_range() is looking at this page, it is uptodate. If the page _was_ uptodate and the zap_pte_range() race happens, we'll end up with with either a dirty page with dirty buffers or a dirty uptodate page with no buffers, both of which are OK. > +void cancel_dirty_page(struct page *page, unsigned int account_size) > +{ > + /* If we're cancelling the page, it had better not be mapped any more */ > + if (page_mapped(page)) { > + static unsigned int warncount; > + > + WARN_ON(++warncount < 5); > + } > + > + if (TestClearPageDirty(page) && account_size) > + task_io_account_cancelled_write(account_size); > +} This doesn't clear the radix-tree dirty tags. I'm not sure what effect that would have on a truncated mapping. Perhaps just a bit of extra work in radix-tree lookup during writeback. If we _know_ that this page is about to be removed from pagecache then radix_tree_delete() will delete the tags for us anyway, but invalidate_inode_pages2() can decide to back out. > @@ -386,12 +399,8 @@ int invalidate_inode_pages2_range(struct address_space *mapping, > PAGE_CACHE_SIZE, 0); > } > } > - was_dirty = test_clear_page_dirty(page); > - if (!invalidate_complete_page2(mapping, page)) { > - if (was_dirty) > - set_page_dirty(page); > + if (!invalidate_complete_page2(mapping, page)) > ret = -EIO; > - } > unlock_page(page); Well, it used to. invalidate_complete_page2() is pretty gruesome. We're handling the case where someone went and redirtied the page (and hence its buffers) after the invalidate_inode_pages2() caller (generic_file_direct_IO) synced it to disk. I'd prefer to just fail the direct-io if someone did that, but then people's tests fail and they whine. It's tempting to just truncate the damn page and discard the user's data - the app is being silly. But that would permit access to uninitialised disk blocks. With your change I think what'll happen is that we'll correctly handle the case where the page and its buffers are dirty (it gets left in place), but we'll needlessy fail in the case where the page is dirty but the buffers are clean. How important that will be in practice I do not know. People will get -EIOs where they used not to. A suitable fix for that might to be to simply not return -EIO here. So some thread went and dirtied a pagecache page after generic_file_direct_IO() synced the data. Big deal, that's your own fault. Usually the disk will end up getting a copy of the dirtied pagecache page and rarely it'll get a copy of the direct-io-written page. ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-20 23:32 ` Andrew Morton @ 2006-12-20 23:55 ` Linus Torvalds 2006-12-21 0:11 ` Andrew Morton 2006-12-21 2:54 ` Trond Myklebust 0 siblings, 2 replies; 311+ messages in thread From: Linus Torvalds @ 2006-12-20 23:55 UTC (permalink / raw) To: Andrew Morton Cc: Martin Michlmayr, Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven, Andrei Popa, Linux Kernel Mailing List, Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens, Arnd Bergmann, gordonfarquharson, Chen, Kenneth W On Wed, 20 Dec 2006, Andrew Morton wrote: > > > +void cancel_dirty_page(struct page *page, unsigned int account_size) > > +{ > > + /* If we're cancelling the page, it had better not be mapped any more */ > > + if (page_mapped(page)) { > > + static unsigned int warncount; > > + > > + WARN_ON(++warncount < 5); > > + } > > + > > + if (TestClearPageDirty(page) && account_size) > > + task_io_account_cancelled_write(account_size); > > +} > > This doesn't clear the radix-tree dirty tags. I'm not sure what effect > that would have on a truncated mapping. Perhaps just a bit of extra work > in radix-tree lookup during writeback. This should _only_ be a valid thing to do when we're removing the page from a mapping anyway, so I'd most definitely hope that the code immediately after (or before) will have done a "remove_from_page_cache()" In which case the tags should not matter. There is _no_ excuse for cancelling a page and leaving it in the page cache that I can see. Because your page contents will be _indeterminate_. > > @@ -386,12 +399,8 @@ int invalidate_inode_pages2_range(struct address_space *mapping, > > invalidate_complete_page2() is pretty gruesome. We're handling the case > where someone went and redirtied the page (and hence its buffers) after the > invalidate_inode_pages2() caller (generic_file_direct_IO) synced it to > disk. > > I'd prefer to just fail the direct-io if someone did that, but then > people's tests fail and they whine. So with my change, afaik, we will just return EIO to the invalidate, and do the write. Which should be ok. In fact, it appears to be the only possibly valid thing to do. It really boils down to that same thing: if you remove the dirty bit, there is NO CONCEIVABLE GOOD THING YOU CAN DO EXCEPT FOR: - do the damn IO already ("clear_page_dirty_for_io()") - truncate the page (unmap and destroy it both from page cache AND from any user-visible filesystem cases) Anything else is simpyl a bug. Always has been. My patch just makes that very clear. > With your change I think what'll happen is that we'll correctly handle the > case where the page and its buffers are dirty (it gets left in place), but > we'll needlessy fail in the case where the page is dirty but the buffers > are clean. How important that will be in practice I do not know. People > will get -EIOs where they used not to. People will now get -EIO where they used to get an inconsistent system image. I really think it sounds like an improvement. Linus ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-20 23:55 ` Linus Torvalds @ 2006-12-21 0:11 ` Andrew Morton 2006-12-21 0:22 ` Linus Torvalds 2006-12-21 2:54 ` Trond Myklebust 1 sibling, 1 reply; 311+ messages in thread From: Andrew Morton @ 2006-12-21 0:11 UTC (permalink / raw) To: Linus Torvalds Cc: Martin Michlmayr, Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven, Andrei Popa, Linux Kernel Mailing List, Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens, Arnd Bergmann, gordonfarquharson, Chen, Kenneth W On Wed, 20 Dec 2006 15:55:14 -0800 (PST) Linus Torvalds <torvalds@osdl.org> wrote: > > > @@ -386,12 +399,8 @@ int invalidate_inode_pages2_range(struct address_space *mapping, > > > > invalidate_complete_page2() is pretty gruesome. We're handling the case > > where someone went and redirtied the page (and hence its buffers) after the > > invalidate_inode_pages2() caller (generic_file_direct_IO) synced it to > > disk. > > > > I'd prefer to just fail the direct-io if someone did that, but then > > people's tests fail and they whine. > > So with my change, afaik, we will just return EIO to the invalidate, and > do the write. The write's already been done by this stage. > Which should be ok. In fact, it appears to be the only > possibly valid thing to do. > > It really boils down to that same thing: if you remove the dirty bit, > there is NO CONCEIVABLE GOOD THING YOU CAN DO EXCEPT FOR: > - do the damn IO already ("clear_page_dirty_for_io()") > - truncate the page (unmap and destroy it both from page cache AND from > any user-visible filesystem cases) There's also redirty_page_for_writepage(). ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-21 0:11 ` Andrew Morton @ 2006-12-21 0:22 ` Linus Torvalds 2006-12-21 0:24 ` Linus Torvalds 2006-12-21 0:43 ` Linus Torvalds 0 siblings, 2 replies; 311+ messages in thread From: Linus Torvalds @ 2006-12-21 0:22 UTC (permalink / raw) To: Andrew Morton Cc: Martin Michlmayr, Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven, Andrei Popa, Linux Kernel Mailing List, Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens, Arnd Bergmann, gordonfarquharson, Chen, Kenneth W On Wed, 20 Dec 2006, Andrew Morton wrote: > > > > So with my change, afaik, we will just return EIO to the invalidate, and > > do the write. > > The write's already been done by this stage. Ok, but the end result is the same: you MUST NOT just "cancel" a write. It needs to be done, or the backing store must be actually de-allocated. You can't just say "get rid of it" and think that it can work. Exactly because of security issues, and just the simple fact that reading it back gets random contents. So I repeat: clearing a dirty bit really only has two valid cases. Not three, like we used to have. And the "cancel" case cannot be conditional: either you can cancel it or you cannot. There's no if (cancel_dirty_page()) { .. sequence that makes sense that I can think of. > > It really boils down to that same thing: if you remove the dirty bit, > > there is NO CONCEIVABLE GOOD THING YOU CAN DO EXCEPT FOR: > > - do the damn IO already ("clear_page_dirty_for_io()") > > - truncate the page (unmap and destroy it both from page cache AND from > > any user-visible filesystem cases) > > There's also redirty_page_for_writepage(). _dirtying_ a page makes sense in any situation. You can always dirty them. I'm just saying that you can't just mark them *clean*. If your point was that the filesystem had better be able to take care of "redirty_page_for_writepage()", then yes, of course. But since it's the filesystem itself that does it, it had _better_ be able to take care of the situation it puts itself into. Linus ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-21 0:22 ` Linus Torvalds @ 2006-12-21 0:24 ` Linus Torvalds 2006-12-21 15:48 ` Andrei Popa 2006-12-21 0:43 ` Linus Torvalds 1 sibling, 1 reply; 311+ messages in thread From: Linus Torvalds @ 2006-12-21 0:24 UTC (permalink / raw) To: Andrew Morton Cc: Martin Michlmayr, Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven, Andrei Popa, Linux Kernel Mailing List, Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens, Arnd Bergmann, gordonfarquharson, Chen, Kenneth W Btw, I'd really love to hear whether the patch I sent out actually _helps_ at all, or whether we're just discussing something that in the end is just a cleanup.. Martin, Peter, Andrei, pls give it a try. (Martin and Andrei may be talking about different bugs, so _both_ of your experiences definitely matter here). Linus ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-21 0:24 ` Linus Torvalds @ 2006-12-21 15:48 ` Andrei Popa 2006-12-21 16:58 ` Linus Torvalds 0 siblings, 1 reply; 311+ messages in thread From: Andrei Popa @ 2006-12-21 15:48 UTC (permalink / raw) To: Linus Torvalds Cc: Andrew Morton, Martin Michlmayr, Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven, Linux Kernel Mailing List, Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens, Arnd Bergmann, gordonfarquharson, Chen, Kenneth W On Wed, 2006-12-20 at 16:24 -0800, Linus Torvalds wrote: > > Btw, I'd really love to hear whether the patch I sent out actually _helps_ > at all, or whether we're just discussing something that in the end is just > a cleanup.. > > Martin, Peter, Andrei, pls give it a try. (Martin and Andrei may be > talking about different bugs, so _both_ of your experiences definitely > matter here). with http://lkml.org/lkml/diff/2006/12/20/204/1 I have corruption: Hash check on download completion found bad chunks, consider using "safe_sync". > > Linus ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-21 15:48 ` Andrei Popa @ 2006-12-21 16:58 ` Linus Torvalds 0 siblings, 0 replies; 311+ messages in thread From: Linus Torvalds @ 2006-12-21 16:58 UTC (permalink / raw) To: Andrei Popa Cc: Andrew Morton, Martin Michlmayr, Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven, Linux Kernel Mailing List, Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens, Arnd Bergmann, gordonfarquharson, Chen, Kenneth W On Thu, 21 Dec 2006, Andrei Popa wrote: > On Wed, 2006-12-20 at 16:24 -0800, Linus Torvalds wrote: > > > > Martin, Peter, Andrei, pls give it a try. (Martin and Andrei may be > > talking about different bugs, so _both_ of your experiences definitely > > matter here). > > with http://lkml.org/lkml/diff/2006/12/20/204/1 > I have corruption: Hash check on download completion found bad chunks, > consider using "safe_sync". Gaah. Martin Michlmayr reported that it apparently fixes his ARM corruption. Now, admittedly I already suspected the issues might be different (if only because of the UP vs SMP/PREEMPT case), but I really had my hopes up after Martin's report, because if anything, _his_ issue might have been a superset of your problem (while obviously any subtle SMP races you might be seeing are definitely not an issue in his case). Oh well. I think the ARM case is enough of a reason to apply those patches (if it hadn't made any difference at all, I'd have waited until after 2.6.20), and we'll just have to continue on the SMP PREEMPT angle. Linus ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-21 0:22 ` Linus Torvalds 2006-12-21 0:24 ` Linus Torvalds @ 2006-12-21 0:43 ` Linus Torvalds 2006-12-21 1:20 ` Andrew Morton 1 sibling, 1 reply; 311+ messages in thread From: Linus Torvalds @ 2006-12-21 0:43 UTC (permalink / raw) To: Andrew Morton Cc: Martin Michlmayr, Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven, Andrei Popa, Linux Kernel Mailing List, Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens, Arnd Bergmann, gordonfarquharson, Chen, Kenneth W On Wed, 20 Dec 2006, Linus Torvalds wrote: > > > > There's also redirty_page_for_writepage(). > > _dirtying_ a page makes sense in any situation. You can always dirty them. > I'm just saying that you can't just mark them *clean*. > > If your point was that the filesystem had better be able to take care of > "redirty_page_for_writepage()", then yes, of course. But since it's the > filesystem itself that does it, it had _better_ be able to take care of > the situation it puts itself into. Btw, as an example of something where this may NOT be ok, look at migrate_page_copy(). I'm not at all convinced that "migrate_page_copy()" can work at all. It does: ... if (PageDirty(page)) { clear_page_dirty_for_io(page); set_page_dirty(newpage); } ... which is an example of what NOT to do, because it claims to clear the page for IO, but doesn't actually _do_ any IO. And this is wrong, for many reasons. For example, it's very possible that the old page is not actually up-to-date, and is only partially dirty using some FS-specific dirty data queues (like NFS does with its dirty data, or buffer-heads can do for local filesystems). When you do if (clear_dirty(page)) set_page_dirty(page); in generic VM code, that is a BUG. It's an insane operation. It cannot work. It's exactly what I'm trying to avoid. So page migration is probably broken, but it's no less broken than it always has been. And I don't think many people use it anyway. It might work "by accident" in a lot of situations, but to actually be solid, it really would need to do something fundamentally different, like: - have a per-mapping "migrate()" function that actually knows HOW to migrate the dirty state from one page to another. - or, preferably, by just not migrating dirty pages, and just actually doing the writeback on them first. Again, this is an example of just _incorrect_ code, that thinks that it can "just clear the dirty bit". You can't do that. It's wrong. And it is not wrong just because I say so, but because the operations itself simply is FUNDAMENTALLY not a sensible one. This is why I keep harping on this issue: there are two cases, and two cases only, when you can clear a page. And no, "migrating the data to another page" was not one of those two cases. The cases are, and will _always_ be: (a) full writeback IO of _all_ the dirty data on the page (and that can only be done by the low-level filesystem, since it's the only one that knows what rules it has followed for marking things dirty) and (b) cancelling dirty data that got truncated and literally removed from the filesystem. So I don't claim that I fixed all the cases. mm/migrate.c is still broken. Maybe somebody else also uses "clear_page_dirty_for_io()" even though the name very clearly says FOR IO. I didn't check, but I think they're mostly right now. Linus ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-21 0:43 ` Linus Torvalds @ 2006-12-21 1:20 ` Andrew Morton 0 siblings, 0 replies; 311+ messages in thread From: Andrew Morton @ 2006-12-21 1:20 UTC (permalink / raw) To: Linus Torvalds Cc: Martin Michlmayr, Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven, Andrei Popa, Linux Kernel Mailing List, Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens, Arnd Bergmann, gordonfarquharson, Chen, Kenneth W, Christoph Lameter On Wed, 20 Dec 2006 16:43:31 -0800 (PST) Linus Torvalds <torvalds@osdl.org> wrote: > > > On Wed, 20 Dec 2006, Linus Torvalds wrote: > > > > > > There's also redirty_page_for_writepage(). > > > > _dirtying_ a page makes sense in any situation. You can always dirty them. > > I'm just saying that you can't just mark them *clean*. > > > > If your point was that the filesystem had better be able to take care of > > "redirty_page_for_writepage()", then yes, of course. But since it's the > > filesystem itself that does it, it had _better_ be able to take care of > > the situation it puts itself into. > > Btw, as an example of something where this may NOT be ok, look at > migrate_page_copy(). > > I'm not at all convinced that "migrate_page_copy()" can work at all. It > does: > > ... > if (PageDirty(page)) { > clear_page_dirty_for_io(page); > set_page_dirty(newpage); Note that this is referring to different pages. > } > ... > > which is an example of what NOT to do, because it claims to clear the page > for IO, but doesn't actually _do_ any IO. > > And this is wrong, for many reasons. > > For example, it's very possible that the old page is not actually > up-to-date, and is only partially dirty using some FS-specific dirty data > queues (like NFS does with its dirty data, or buffer-heads can do for > local filesystems). afaict the code copes with those things. > When you do > > if (clear_dirty(page)) > set_page_dirty(page); > > in generic VM code, that is a BUG. It's an insane operation. It cannot > work. It's exactly what I'm trying to avoid. These are different pages. We could view the copy_highpage() in migrate_page_copy() as an "io" operation, only the backing store is a new pagecache page. It'd be more logical if that copy_highpage() was occurring after the clear_page_dirty_for_io(). I'm not sure why migrate_page_copy() is playing with PageWriteback(newpage). Surely newpage is locked, in which case nobody will be starting writeback on it. > So page migration is probably broken, but it's no less broken than it > always has been. And I don't think many people use it anyway. It might > work "by accident" in a lot of situations, but to actually be solid, it > really would need to do something fundamentally different, like: > > - have a per-mapping "migrate()" function that actually knows HOW to > migrate the dirty state from one page to another. That is how it's presently implemented. You're looking at helper functions which fileystems may point their address_space_operations.migratepage at. > - or, preferably, by just not migrating dirty pages, and just actually > doing the writeback on them first. > > Again, this is an example of just _incorrect_ code, that thinks that it > can "just clear the dirty bit". You can't do that. It's wrong. And it is > not wrong just because I say so, but because the operations itself simply > is FUNDAMENTALLY not a sensible one. The dirty state is being transferred to the new page. The tricky part is handling the cases where these pages are mapped into pagetables. That's what the special migration ptes are there for. I'll let Christoph explain that lot ;) ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-20 23:55 ` Linus Torvalds 2006-12-21 0:11 ` Andrew Morton @ 2006-12-21 2:54 ` Trond Myklebust 2006-12-21 17:19 ` Linus Torvalds 1 sibling, 1 reply; 311+ messages in thread From: Trond Myklebust @ 2006-12-21 2:54 UTC (permalink / raw) To: Linus Torvalds Cc: Andrew Morton, Martin Michlmayr, Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven, Andrei Popa, Linux Kernel Mailing List, Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens, Arnd Bergmann, gordonfarquharson, Chen, Kenneth W On Wed, 2006-12-20 at 15:55 -0800, Linus Torvalds wrote: > > With your change I think what'll happen is that we'll correctly handle the > > case where the page and its buffers are dirty (it gets left in place), but > > we'll needlessy fail in the case where the page is dirty but the buffers > > are clean. How important that will be in practice I do not know. People > > will get -EIOs where they used not to. > > People will now get -EIO where they used to get an inconsistent system > image. I really think it sounds like an improvement. The hell it is. You end up with a corrupted page cache because invalidate_inode_pages2_range() immediately exits without throwing out the pages in the rest of the range. I can't see that it is the business of invalidate_inode_pages2() to resolve races between ->direct_IO() and pages that are redirtied by mmap(). All it needs to ensure is that pages that clean are discarded, since those are neither consistent with data that the ->directIO() call wrote to the disk nor are they scheduled to be written to disk. The only case that I can see that is still problematic is NFS because it may have unstable writes (hence the ->launder_page() patch that I posted yesterday). Trond ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-21 2:54 ` Trond Myklebust @ 2006-12-21 17:19 ` Linus Torvalds 0 siblings, 0 replies; 311+ messages in thread From: Linus Torvalds @ 2006-12-21 17:19 UTC (permalink / raw) To: Trond Myklebust Cc: Andrew Morton, Martin Michlmayr, Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven, Andrei Popa, Linux Kernel Mailing List, Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens, Arnd Bergmann, gordonfarquharson, Chen, Kenneth W On Wed, 20 Dec 2006, Trond Myklebust wrote: > > I can't see that it is the business of invalidate_inode_pages2() to > resolve races between ->direct_IO() and pages that are redirtied by > mmap(). All it needs to ensure is that pages that clean are discarded, > since those are neither consistent with data that the ->directIO() call > wrote to the disk nor are they scheduled to be written to disk. Sure, we could happily just remove the -EIO. Alternatively, we could still do all the invalidates over the whole range, and return -EIO at the end of any of the pages weren't invalidated because they had to be written back. I don't personally care whether we should just return success or something to indicate that there were busy pages, but somebody who _uses_ direct-IO might want to know that the thing didn't throw away everything. If you know such users, can you ask them? (Maybe "-EAGAIN" is better than "-EIO", since it's not really even a fatal error). Linus ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-20 19:50 ` Linus Torvalds ` (4 preceding siblings ...) 2006-12-20 23:32 ` Andrew Morton @ 2006-12-21 7:32 ` Gordon Farquharson 2006-12-21 7:53 ` Linus Torvalds 2006-12-21 11:21 ` Martin Michlmayr 6 siblings, 1 reply; 311+ messages in thread From: Gordon Farquharson @ 2006-12-21 7:32 UTC (permalink / raw) To: Linus Torvalds Cc: Martin Michlmayr, Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven, Andrei Popa, Andrew Morton, Linux Kernel Mailing List, Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens, Arnd Bergmann On 12/20/06, Linus Torvalds <torvalds@osdl.org> wrote: > Ok, I'll just put my money where my mouth is, and suggest a patch like > THIS instead. > Martin, Andrei, does this make any difference for your corruption cases? Unfortunately, I cannot get the latest git version of the kernel to boot on the ARM machine on which Martin and I are experiencing the apt segfault. After the kernel is finished uncompressing it prints "done, booting the kernel." as expected, but nothing more happens. I have tried both with and without the patch. Hopefully either Andrei or Martin will have better luck at testing this patch than I have had. Gordon -- Gordon Farquharson ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-21 7:32 ` Gordon Farquharson @ 2006-12-21 7:53 ` Linus Torvalds 2006-12-21 8:38 ` Martin Michlmayr ` (2 more replies) 0 siblings, 3 replies; 311+ messages in thread From: Linus Torvalds @ 2006-12-21 7:53 UTC (permalink / raw) To: Gordon Farquharson Cc: Martin Michlmayr, Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven, Andrei Popa, Andrew Morton, Linux Kernel Mailing List, Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens, Arnd Bergmann On Thu, 21 Dec 2006, Gordon Farquharson wrote: > > Unfortunately, I cannot get the latest git version of the kernel to > boot on the ARM machine on which Martin and I are experiencing the apt > segfault. Ouch. > After the kernel is finished uncompressing it prints "done, > booting the kernel." as expected, but nothing more happens. I have > tried both with and without the patch. Hopefully either Andrei or > Martin will have better luck at testing this patch than I have had. That's obviously a bug worth fixing on its own. Do you know when it started? That said, I think the patch I sent out should actually work on top of plain 2.6.19 too. I don't think things have changed in this area that much. IOW, you don't _need_ latest -git to test it, you just need a broken kernel ;) Linus ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-21 7:53 ` Linus Torvalds @ 2006-12-21 8:38 ` Martin Michlmayr 2006-12-21 8:59 ` Linus Torvalds 2006-12-21 9:17 ` Gordon Farquharson 2006-12-21 12:30 ` Russell King 2 siblings, 1 reply; 311+ messages in thread From: Martin Michlmayr @ 2006-12-21 8:38 UTC (permalink / raw) To: Linus Torvalds Cc: Gordon Farquharson, Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven, Andrei Popa, Andrew Morton, Linux Kernel Mailing List, Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens, Arnd Bergmann * Linus Torvalds <torvalds@osdl.org> [2006-12-20 23:53]: > > Unfortunately, I cannot get the latest git version of the kernel to > > boot on the ARM machine on which Martin and I are experiencing the apt > > segfault. > > Ouch. > > That's obviously a bug worth fixing on its own. Do you know when it > started? This is a known issue. The following patch has been proposed http://www.arm.linux.org.uk/developer/patches/viewpatch.php?id=4030/1 although I just notice that it has been marked as "discarded". Apparently Russell King commited a better patch so this should be fixed in git when he sends his next pull request. -- Martin Michlmayr http://www.cyrius.com/ ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-21 8:38 ` Martin Michlmayr @ 2006-12-21 8:59 ` Linus Torvalds 0 siblings, 0 replies; 311+ messages in thread From: Linus Torvalds @ 2006-12-21 8:59 UTC (permalink / raw) To: Martin Michlmayr Cc: Gordon Farquharson, Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven, Andrei Popa, Andrew Morton, Linux Kernel Mailing List, Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens, Arnd Bergmann On Thu, 21 Dec 2006, Martin Michlmayr wrote: > > This is a known issue. The following patch has been proposed > http://www.arm.linux.org.uk/developer/patches/viewpatch.php?id=4030/1 > although I just notice that it has been marked as "discarded". > Apparently Russell King commited a better patch so this should be > fixed in git when he sends his next pull request. Ahh, ok. Then it might even be in the set of merges I did earlier today (and which should mirror out soon enough, hopefully). Linus ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-21 7:53 ` Linus Torvalds 2006-12-21 8:38 ` Martin Michlmayr @ 2006-12-21 9:17 ` Gordon Farquharson 2006-12-21 9:27 ` Andrew Morton 2006-12-21 12:30 ` Russell King 2 siblings, 1 reply; 311+ messages in thread From: Gordon Farquharson @ 2006-12-21 9:17 UTC (permalink / raw) To: Linus Torvalds Cc: Martin Michlmayr, Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven, Andrei Popa, Andrew Morton, Linux Kernel Mailing List, Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens, Arnd Bergmann On 12/21/06, Linus Torvalds <torvalds@osdl.org> wrote: > That said, I think the patch I sent out should actually work on top of > plain 2.6.19 too. I don't think things have changed in this area that > much. IOW, you don't _need_ latest -git to test it, you just need a broken > kernel ;) I created a version of your patch that applied to 2.6.19, but it doesn't compile: mm/built-in.o: In function `cancel_dirty_page': slab.c:(.text+0x8964): undefined reference to `task_io_account_cancelled_write' make[3]: *** [.tmp_vmlinux1] Error 1 It looks like task_io_account_cancelled_write() was added in http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=7c3ab7381e79dfc7db14a67c6f4f3285664e1ec2 Can the call to task_io_account_cancelled_write() simply be removed from cancel_dirty_page() for testing the patch with 2.6.19 (since 2.6.19 doesn't seem to have the task I/O accounting) ? Gordon -- Gordon Farquharson ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-21 9:17 ` Gordon Farquharson @ 2006-12-21 9:27 ` Andrew Morton 2006-12-22 4:20 ` Gordon Farquharson 0 siblings, 1 reply; 311+ messages in thread From: Andrew Morton @ 2006-12-21 9:27 UTC (permalink / raw) To: Gordon Farquharson Cc: Linus Torvalds, Martin Michlmayr, Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven, Andrei Popa, Linux Kernel Mailing List, Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens, Arnd Bergmann On Thu, 21 Dec 2006 02:17:05 -0700 "Gordon Farquharson" <gordonfarquharson@gmail.com> wrote: > Can the call to task_io_account_cancelled_write() simply be removed > from cancel_dirty_page() for testing the patch with 2.6.19 (since > 2.6.19 doesn't seem to have the task I/O accounting) ? Yes. ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-21 9:27 ` Andrew Morton @ 2006-12-22 4:20 ` Gordon Farquharson 2006-12-22 4:54 ` Linus Torvalds 2006-12-22 10:01 ` Martin Michlmayr 0 siblings, 2 replies; 311+ messages in thread From: Gordon Farquharson @ 2006-12-22 4:20 UTC (permalink / raw) To: Andrew Morton Cc: Linus Torvalds, Martin Michlmayr, Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven, Andrei Popa, Linux Kernel Mailing List, Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens, Arnd Bergmann On 12/21/06, Andrew Morton <akpm@osdl.org> wrote: > > Can the call to task_io_account_cancelled_write() simply be removed > > from cancel_dirty_page() for testing the patch with 2.6.19 (since > > 2.6.19 doesn't seem to have the task I/O accounting) ? > > Yes. I tested 2.6.19 with a version of Linus's patch that applies cleanly to 2.6.19 (patch appended to the end of this email) on ARM and apt-get failed. It did not segfault this time, but instead got stuck for about 20 to 30 minutes and was accessing the hard drive frequently. Here is some background about the problem we see with apt which may help somebody with knowledge of the apt source code analyse the problem in the context of the patch. When apt-get is first run, it generates pkgcache.bin and srcpkgcache.bin in /var/cache/apt. We have found that these are the files that get corrupted when we apply the patch "mm: tracking shared dirty pages" [1] to 2.6.18. The corruption of these files is what causes apt-get to segfault. I have observed that the normal operation of apt-get is that while apt-get is generating these files, pkgcache.bin grows to 12582912 bytes, and when apt-get finishes, pkgcache.bin is 6425533 bytes and srcpkgcache.bin is 64254483 bytes. This time, when apt-get exited, it had only created pkgcache.bin which was still 12582912 bytes. Also, the patch caused apt to slow down a lot. I ran apt-get -f install after apt had exited, and it took so long that I killed it before it had finished. I did not try 2.6.20-git, but I presume that this version is what Martin tried earlier. Maybe Linus's patch doesn't work with 2.6.19, because 2.6.19 is missing some other patch. Gordon [1] http://www2.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=d08b3851da41d0ee60851f2c75b118e1f7a5fc89 diff -Naupr linux-2.6.19.orig/fs/buffer.c linux-2.6.19/fs/buffer.c --- linux-2.6.19.orig/fs/buffer.c 2006-11-29 14:57:37.000000000 -0700 +++ linux-2.6.19/fs/buffer.c 2006-12-21 01:16:31.000000000 -0700 @@ -2832,7 +2832,7 @@ int try_to_free_buffers(struct page *pag int ret = 0; BUG_ON(!PageLocked(page)); - if (PageWriteback(page)) + if (PageDirty(page) || PageWriteback(page)) return 0; if (mapping == NULL) { /* can this still happen? */ @@ -2843,17 +2843,6 @@ int try_to_free_buffers(struct page *pag spin_lock(&mapping->private_lock); ret = drop_buffers(page, &buffers_to_free); spin_unlock(&mapping->private_lock); - if (ret) { - /* - * If the filesystem writes its buffers by hand (eg ext3) - * then we can have clean buffers against a dirty page. We - * clean the page here; otherwise later reattachment of buffers - * could encounter a non-uptodate page, which is unresolvable. - * This only applies in the rare case where try_to_free_buffers - * succeeds but the page is not freed. - */ - clear_page_dirty(page); - } out: if (buffers_to_free) { struct buffer_head *bh = buffers_to_free; diff -Naupr linux-2.6.19.orig/fs/hugetlbfs/inode.c linux-2.6.19/fs/hugetlbfs/inode.c --- linux-2.6.19.orig/fs/hugetlbfs/inode.c 2006-11-29 14:57:37.000000000 -0700 +++ linux-2.6.19/fs/hugetlbfs/inode.c 2006-12-21 01:15:21.000000000 -0700 @@ -176,7 +176,7 @@ static int hugetlbfs_commit_write(struct static void truncate_huge_page(struct page *page) { - clear_page_dirty(page); + cancel_dirty_page(page, /* No IO accounting for huge pages? */0); ClearPageUptodate(page); remove_from_page_cache(page); put_page(page); diff -Naupr linux-2.6.19.orig/include/linux/page-flags.h linux-2.6.19/include/linux/page-flags.h --- linux-2.6.19.orig/include/linux/page-flags.h 2006-11-29 14:57:37.000000000 -0700 +++ linux-2.6.19/include/linux/page-flags.h 2006-12-21 01:15:21.000000000 -0700 @@ -253,15 +253,11 @@ static inline void SetPageUptodate(struc struct page; /* forward declaration */ -int test_clear_page_dirty(struct page *page); +extern void cancel_dirty_page(struct page *page, unsigned int account_size); + int test_clear_page_writeback(struct page *page); int test_set_page_writeback(struct page *page); -static inline void clear_page_dirty(struct page *page) -{ - test_clear_page_dirty(page); -} - static inline void set_page_writeback(struct page *page) { test_set_page_writeback(page); diff -Naupr linux-2.6.19.orig/mm/memory.c linux-2.6.19/mm/memory.c --- linux-2.6.19.orig/mm/memory.c 2006-11-29 14:57:37.000000000 -0700 +++ linux-2.6.19/mm/memory.c 2006-12-21 01:15:21.000000000 -0700 @@ -1832,6 +1832,33 @@ void unmap_mapping_range(struct address_ } EXPORT_SYMBOL(unmap_mapping_range); +static void check_last_page(struct address_space *mapping, loff_t size) +{ + pgoff_t index; + unsigned int offset; + struct page *page; + + if (!mapping) + return; + offset = size & ~PAGE_MASK; + if (!offset) + return; + index = size >> PAGE_SHIFT; + page = find_lock_page(mapping, index); + if (page) { + unsigned int check = 0; + unsigned char *kaddr = kmap_atomic(page, KM_USER0); + do { + check += kaddr[offset++]; + } while (offset < PAGE_SIZE); + kunmap_atomic(kaddr,KM_USER0); + unlock_page(page); + page_cache_release(page); + if (check) + printk("%s: BADNESS: truncate check %u\n", current->comm, check); + } +} + /** * vmtruncate - unmap mappings "freed" by truncate() syscall * @inode: inode of the file used @@ -1865,6 +1892,7 @@ do_expand: goto out_sig; if (offset > inode->i_sb->s_maxbytes) goto out_big; + check_last_page(mapping, inode->i_size); i_size_write(inode, offset); out_truncate: diff -Naupr linux-2.6.19.orig/mm/page-writeback.c linux-2.6.19/mm/page-writeback.c --- linux-2.6.19.orig/mm/page-writeback.c 2006-11-29 14:57:37.000000000 -0700 +++ linux-2.6.19/mm/page-writeback.c 2006-12-21 01:26:53.000000000 -0700 @@ -843,39 +843,6 @@ int set_page_dirty_lock(struct page *pag EXPORT_SYMBOL(set_page_dirty_lock); /* - * Clear a page's dirty flag, while caring for dirty memory accounting. - * Returns true if the page was previously dirty. - */ -int test_clear_page_dirty(struct page *page) -{ - struct address_space *mapping = page_mapping(page); - unsigned long flags; - - if (mapping) { - write_lock_irqsave(&mapping->tree_lock, flags); - if (TestClearPageDirty(page)) { - radix_tree_tag_clear(&mapping->page_tree, - page_index(page), - PAGECACHE_TAG_DIRTY); - write_unlock_irqrestore(&mapping->tree_lock, flags); - /* - * We can continue to use `mapping' here because the - * page is locked, which pins the address_space - */ - if (mapping_cap_account_dirty(mapping)) { - page_mkclean(page); - dec_zone_page_state(page, NR_FILE_DIRTY); - } - return 1; - } - write_unlock_irqrestore(&mapping->tree_lock, flags); - return 0; - } - return TestClearPageDirty(page); -} -EXPORT_SYMBOL(test_clear_page_dirty); - -/* * Clear a page's dirty flag, while caring for dirty memory accounting. * Returns true if the page was previously dirty. * diff -Naupr linux-2.6.19.orig/mm/truncate.c linux-2.6.19/mm/truncate.c --- linux-2.6.19.orig/mm/truncate.c 2006-11-29 14:57:37.000000000 -0700 +++ linux-2.6.19/mm/truncate.c 2006-12-21 15:58:18.000000000 -0700 @@ -50,6 +50,17 @@ static inline void truncate_partial_page do_invalidatepage(page, partial); } +void cancel_dirty_page(struct page *page, unsigned int account_size) +{ + /* If we're cancelling the page, it had better not be mapped any more */+ if (page_mapped(page)) { + static unsigned int warncount; + + WARN_ON(++warncount < 5); + } +} + + /* * If truncate cannot remove the fs-private metadata from the page, the page * becomes anonymous. It will be left on the LRU and may even be mapped into @@ -69,7 +80,8 @@ truncate_complete_page(struct address_sp if (PagePrivate(page)) do_invalidatepage(page, 0); - clear_page_dirty(page); + cancel_dirty_page(page, PAGE_CACHE_SIZE); + ClearPageUptodate(page); ClearPageMappedToDisk(page); remove_from_page_cache(page); @@ -348,7 +360,6 @@ int invalidate_inode_pages2_range(struct for (i = 0; !ret && i < pagevec_count(&pvec); i++) { struct page *page = pvec.pages[i]; pgoff_t page_index; - int was_dirty; lock_page(page); if (page->mapping != mapping) { @@ -384,12 +395,8 @@ int invalidate_inode_pages2_range(struct PAGE_CACHE_SIZE, 0); } } - was_dirty = test_clear_page_dirty(page); - if (!invalidate_complete_page2(mapping, page)) { - if (was_dirty) - set_page_dirty(page); + if (!invalidate_complete_page2(mapping, page)) ret = -EIO; - } unlock_page(page); } pagevec_release(&pvec); -- Gordon Farquharson ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-22 4:20 ` Gordon Farquharson @ 2006-12-22 4:54 ` Linus Torvalds 2006-12-22 10:00 ` Martin Michlmayr 2006-12-22 15:08 ` Gordon Farquharson 2006-12-22 10:01 ` Martin Michlmayr 1 sibling, 2 replies; 311+ messages in thread From: Linus Torvalds @ 2006-12-22 4:54 UTC (permalink / raw) To: Gordon Farquharson Cc: Andrew Morton, Martin Michlmayr, Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven, Andrei Popa, Linux Kernel Mailing List, Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens, Arnd Bergmann On Thu, 21 Dec 2006, Gordon Farquharson wrote: > > I tested 2.6.19 with a version of Linus's patch that applies cleanly > to 2.6.19 (patch appended to the end of this email) on ARM and apt-get > failed. It did not segfault this time, but instead got stuck for about > 20 to 30 minutes and was accessing the hard drive frequently. Ok, there's definitely something screwy going on. Andrew located at least one bug: we run cancel_dirty_page() too late in "truncate_complete_page()", which means that do_invalidatepage() ends up not clearing the page cache. His patch is appended. But it sounds like I probably misunderstood something, because I thought that Martin had acknowledged that this patch actually worked for him. Which sounded very similar to your setup (he has a 32M ARM box too, no?) And your failure sounds a lot like one that David Miller is reporting. At the same time, my own shared file mmap tests on my own machines obviously work fine (I lower the dirty-writeback tresholds to force writeback more easily, and then mmap a file and write and rewrite to it in memory, and truncate it). Maybe it's mount option issue? I've got data=ordered on my machine, are you perhaps runnign with something else? Linus --- commit 3e67c0987d7567ad666641164a153dca9a43b11d Author: Andrew Morton <akpm@osdl.org> Date: Thu Dec 21 11:00:33 2006 -0800 [PATCH] truncate: clear page dirtiness before running try_to_free_buffers() truncate presently invalidates the dirty page's buffer_heads then shoots down the page. But try_to_free_buffers() will now bale out because the page is dirty. Net effect: the LRU gets filled with dirty pages which have invalidated buffer_heads attached. They have no ->mapping and hence cannot be cleaned. The machine leaks memory at an enormous rate. Fix this by cleaning the page before running try_to_free_buffers(), so try_to_free_buffers() can do its work. Also, remember to do dirty-page-acoounting in cancel_dirty_page() so the machine won't wedge up trying to write non-existent dirty pages. Probably still wrong, but now less so. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org> diff --git a/mm/truncate.c b/mm/truncate.c index bf9e296..89a5c35 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -60,11 +60,12 @@ void cancel_dirty_page(struct page *page, unsigned int account_size) WARN_ON(++warncount < 5); } - if (TestClearPageDirty(page) && account_size) + if (TestClearPageDirty(page) && account_size) { + dec_zone_page_state(page, NR_FILE_DIRTY); task_io_account_cancelled_write(account_size); + } } - /* * If truncate cannot remove the fs-private metadata from the page, the page * becomes anonymous. It will be left on the LRU and may even be mapped into @@ -81,11 +82,11 @@ truncate_complete_page(struct address_space *mapping, struct page *page) if (page->mapping != mapping) return; + cancel_dirty_page(page, PAGE_CACHE_SIZE); + if (PagePrivate(page)) do_invalidatepage(page, 0); - cancel_dirty_page(page, PAGE_CACHE_SIZE); - ClearPageUptodate(page); ClearPageMappedToDisk(page); remove_from_page_cache(page); ^ permalink raw reply related [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-22 4:54 ` Linus Torvalds @ 2006-12-22 10:00 ` Martin Michlmayr 2006-12-22 10:06 ` Martin Michlmayr 2006-12-22 10:17 ` Andrew Morton 2006-12-22 15:08 ` Gordon Farquharson 1 sibling, 2 replies; 311+ messages in thread From: Martin Michlmayr @ 2006-12-22 10:00 UTC (permalink / raw) To: Linus Torvalds Cc: Gordon Farquharson, Andrew Morton, Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven, Andrei Popa, Linux Kernel Mailing List, Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens, Arnd Bergmann * Linus Torvalds <torvalds@osdl.org> [2006-12-21 20:54]: > But it sounds like I probably misunderstood something, because I thought > that Martin had acknowledged that this patch actually worked for him. That's what I thought too but now I can confirm what Gordon sees. But it's pretty weird. Our testcase is to run Debian installer on the NSLU2 arm device and apt-get would either segfault or hang at this particular spot in the installation (when apt is first run). With your patch, apt works correctly where it normally fails (at least for me). I stopped the installation at this point and repeated it several more times to make sure it's really working. And, yes, I can repeat this result. This time, however, I let the installer continue and it seems that with your patch apt now works where it failed in the past, but it hangs later on. It's pretty weird because I cannot even kill the process: sh-3.1# ps aux | grep 31126 root 31126 5.7 20.6 16240 6076 ? R+ 04:45 0:21 apt-get -o APT::Status-Fd=4 -o APT::Keep-Fds::=5 -o APT::Keep-Fds::=6 -q -y -f install popularity-contest root 31157 0.0 1.6 1516 492 ttyS0 S+ 04:51 0:00 grep 31126 sh-3.1# kill -9 31126 sh-3.1# kill -9 31126 sh-3.1# ps aux | grep 31126 root 31126 5.6 20.6 16240 6076 ? R+ 04:45 0:21 apt-get -o APT::Status-Fd=4 -o APT::Keep-Fds::=5 -o APT::Keep-Fds::=6 -q -y -f install popularity-contest root 31159 0.0 1.6 1516 492 ttyS0 S+ 04:51 0:00 grep 31126 sh-3.1# > Which sounded very similar to your setup (he has a 32M ARM box too, no?) It's the same device, a Linksys NSLU2. > Author: Andrew Morton <akpm@osdl.org> This patch makes it even worse for me. > - if (TestClearPageDirty(page) && account_size) > + if (TestClearPageDirty(page) && account_size) { > + dec_zone_page_state(page, NR_FILE_DIRTY); > task_io_account_cancelled_write(account_size); > + } This hunk (on top of git from about 2 days ago and your latest patch) results in the installer hanging right at the start. The Linux kernel boots fine, the debian-installer is loaded into a ramdisk but when ncurses is being started it just hangs. Reverting this hunk makes it start again. Does that help or confuse you even more? -- Martin Michlmayr http://www.cyrius.com/ ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-22 10:00 ` Martin Michlmayr @ 2006-12-22 10:06 ` Martin Michlmayr 2006-12-22 10:10 ` Martin Michlmayr 2006-12-22 10:17 ` Andrew Morton 1 sibling, 1 reply; 311+ messages in thread From: Martin Michlmayr @ 2006-12-22 10:06 UTC (permalink / raw) To: Linus Torvalds Cc: Gordon Farquharson, Andrew Morton, Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven, Andrei Popa, Linux Kernel Mailing List, Florian Weimer, Marc Haber * Martin Michlmayr <tbm@cyrius.com> [2006-12-22 11:00]: > This time, however, I let the installer continue and it seems that > with your patch apt now works where it failed in the past, but it > hangs later on. It's pretty weird because I cannot even kill the > process: Okay, it's really weird. So apt-get just hangs doing nothing and I cannot even kill it. I just tried to download strace via wget and immediately when I started wget, the hanging apt-get process continued. -- Martin Michlmayr http://www.cyrius.com/ ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-22 10:06 ` Martin Michlmayr @ 2006-12-22 10:10 ` Martin Michlmayr 2006-12-22 11:07 ` Martin Michlmayr 2006-12-22 15:30 ` Gordon Farquharson 0 siblings, 2 replies; 311+ messages in thread From: Martin Michlmayr @ 2006-12-22 10:10 UTC (permalink / raw) To: Linus Torvalds Cc: Gordon Farquharson, Andrew Morton, Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven, Andrei Popa, Linux Kernel Mailing List, Florian Weimer, Marc Haber * Martin Michlmayr <tbm@cyrius.com> [2006-12-22 11:06]: > Okay, it's really weird. So apt-get just hangs doing nothing and I > cannot even kill it. I just tried to download strace via wget and > immediately when I started wget, the hanging apt-get process > continued. ... and now that we've completed this step, the apt cache has suddenly been reduced (see Gordon's mail for an explanation) and it segfaults: sh-3.1# ls -l /var/cache/apt/ total 12524 drwxr-xr-x 3 root root 12288 Dec 22 04:41 archives -rw-r--r-- 1 root root 6426885 Dec 22 05:03 pkgcache.bin -rw-r--r-- 1 root root 6426835 Dec 22 05:03 srcpkgcache.bin sh-3.1# apt-get -f install Reading package lists... Done Segmentation faulty tree... 50% -- Martin Michlmayr http://www.cyrius.com/ ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-22 10:10 ` Martin Michlmayr @ 2006-12-22 11:07 ` Martin Michlmayr 2006-12-22 15:30 ` Gordon Farquharson 1 sibling, 0 replies; 311+ messages in thread From: Martin Michlmayr @ 2006-12-22 11:07 UTC (permalink / raw) To: Linus Torvalds Cc: Gordon Farquharson, Andrew Morton, Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven, Andrei Popa, Linux Kernel Mailing List, Florian Weimer, Marc Haber * Martin Michlmayr <tbm@cyrius.com> [2006-12-22 11:10]: > > immediately when I started wget, the hanging apt-get process > > continued. > ... and now that we've completed this step, the apt cache has suddenly > been reduced (see Gordon's mail for an explanation) and it segfaults: One of my questions was why apt-get worked to install the initramfs-tools, the kernel and some other packages but later hung while it was building the cache (which clearly it had built already to install some packages): before the installer offers to install additional packages, it changes the apt sources, which leads to apt rebuilding the cache, and here it hangs. Remember how I said that downloading a file with wget prompts apt to work again? Apparently any filesystem access will do (I just ran find / > /dev/null). Gordon, can you confirm this? -- Martin Michlmayr http://www.cyrius.com/ ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-22 10:10 ` Martin Michlmayr 2006-12-22 11:07 ` Martin Michlmayr @ 2006-12-22 15:30 ` Gordon Farquharson 2006-12-22 17:11 ` Martin Michlmayr 1 sibling, 1 reply; 311+ messages in thread From: Gordon Farquharson @ 2006-12-22 15:30 UTC (permalink / raw) To: Martin Michlmayr Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven, Andrei Popa, Linux Kernel Mailing List, Florian Weimer, Marc Haber On 12/22/06, Martin Michlmayr <tbm@cyrius.com> wrote: > ... and now that we've completed this step, the apt cache has suddenly > been reduced (see Gordon's mail for an explanation) and it segfaults: > > sh-3.1# ls -l /var/cache/apt/ > total 12524 > drwxr-xr-x 3 root root 12288 Dec 22 04:41 archives > -rw-r--r-- 1 root root 6426885 Dec 22 05:03 pkgcache.bin > -rw-r--r-- 1 root root 6426835 Dec 22 05:03 srcpkgcache.bin > sh-3.1# apt-get -f install > Reading package lists... Done > Segmentation faulty tree... 50% I think that we are seeing different manifestations of apt's response to corrupted cache files. There does not appear to be any pattern to which manifestation occurs. Maybe it depends on where in the cache file the corruption is located, i.e. when the corruption occurs. Based on the kernel gurus current knowledge of the problem, would you expect the corruption to occur at the same point in a file, or is it possible that the corruption could occur at different points on successive Debian installer attempts on a UP, non PREEMPT system ? Gordon -- Gordon Farquharson ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-22 15:30 ` Gordon Farquharson @ 2006-12-22 17:11 ` Martin Michlmayr 0 siblings, 0 replies; 311+ messages in thread From: Martin Michlmayr @ 2006-12-22 17:11 UTC (permalink / raw) To: Gordon Farquharson Cc: Linus Torvalds, Andrew Morton, Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven, Andrei Popa, Linux Kernel Mailing List, Florian Weimer, Marc Haber * Gordon Farquharson <gordonfarquharson@gmail.com> [2006-12-22 08:30]: > Based on the kernel gurus current knowledge of the problem, would > you expect the corruption to occur at the same point in a file, or > is it possible that the corruption could occur at different points > on successive Debian installer attempts on a UP, non PREEMPT system? Seems like it can occur anywhere. In fact, some people see apt problems because of filesystem corruption on the NSLU2 after they have already installe Debian. I've only seen this once myself and failed many times to find a reproducible situation. -- Martin Michlmayr http://www.cyrius.com/ ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-22 10:00 ` Martin Michlmayr 2006-12-22 10:06 ` Martin Michlmayr @ 2006-12-22 10:17 ` Andrew Morton 2006-12-22 11:12 ` Martin Michlmayr 2006-12-22 12:24 ` Andrei Popa 1 sibling, 2 replies; 311+ messages in thread From: Andrew Morton @ 2006-12-22 10:17 UTC (permalink / raw) To: Martin Michlmayr Cc: Linus Torvalds, Gordon Farquharson, Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven, Andrei Popa, Linux Kernel Mailing List, Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens, Arnd Bergmann On Fri, 22 Dec 2006 11:00:04 +0100 Martin Michlmayr <tbm@cyrius.com> wrote: > > - if (TestClearPageDirty(page) && account_size) > > + if (TestClearPageDirty(page) && account_size) { > > + dec_zone_page_state(page, NR_FILE_DIRTY); > > task_io_account_cancelled_write(account_size); > > + } > > This hunk (on top of git from about 2 days ago and your latest patch) > results in the installer hanging right at the start. You'll need this also: From: Andrew Morton <akpm@osdl.org> Only (un)account for IO and page-dirtying for devices which have real backing store (ie: not tmpfs or ramdisks). Cc: "David S. Miller" <davem@davemloft.net> Cc: Linus Torvalds <torvalds@osdl.org> Signed-off-by: Andrew Morton <akpm@osdl.org> --- mm/truncate.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff -puN mm/truncate.c~truncate-dirty-memory-accounting-fix mm/truncate.c --- a/mm/truncate.c~truncate-dirty-memory-accounting-fix +++ a/mm/truncate.c @@ -60,7 +60,8 @@ void cancel_dirty_page(struct page *page WARN_ON(++warncount < 5); } - if (TestClearPageDirty(page) && account_size) { + if (TestClearPageDirty(page) && account_size && + mapping_cap_account_dirty(page->mapping)) { dec_zone_page_state(page, NR_FILE_DIRTY); task_io_account_cancelled_write(account_size); } _ ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-22 10:17 ` Andrew Morton @ 2006-12-22 11:12 ` Martin Michlmayr 2006-12-22 12:24 ` Andrei Popa 1 sibling, 0 replies; 311+ messages in thread From: Martin Michlmayr @ 2006-12-22 11:12 UTC (permalink / raw) To: Andrew Morton Cc: Linus Torvalds, Gordon Farquharson, Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven, Andrei Popa, Linux Kernel Mailing List * Andrew Morton <akpm@osdl.org> [2006-12-22 02:17]: > > This hunk (on top of git from about 2 days ago and your latest patch) > > results in the installer hanging right at the start. > > You'll need this also: It starts again, thanks. -- Martin Michlmayr http://www.cyrius.com/ ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-22 10:17 ` Andrew Morton 2006-12-22 11:12 ` Martin Michlmayr @ 2006-12-22 12:24 ` Andrei Popa 2006-12-22 12:32 ` Martin Michlmayr 1 sibling, 1 reply; 311+ messages in thread From: Andrei Popa @ 2006-12-22 12:24 UTC (permalink / raw) To: Andrew Morton Cc: Martin Michlmayr, Linus Torvalds, Gordon Farquharson, Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven, Linux Kernel Mailing List, Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens, Arnd Bergmann With all three patches I have corruption.... diff --git a/fs/buffer.c b/fs/buffer.c index d1f1b54..263f88e 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *pag int ret = 0; BUG_ON(!PageLocked(page)); - if (PageWriteback(page)) + if (PageDirty(page) || PageWriteback(page)) return 0; if (mapping == NULL) { /* can this still happen? */ @@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *pag spin_lock(&mapping->private_lock); ret = drop_buffers(page, &buffers_to_free); spin_unlock(&mapping->private_lock); - if (ret) { - /* - * If the filesystem writes its buffers by hand (eg ext3) - * then we can have clean buffers against a dirty page. We - * clean the page here; otherwise later reattachment of buffers - * could encounter a non-uptodate page, which is unresolvable. - * This only applies in the rare case where try_to_free_buffers - * succeeds but the page is not freed. - * - * Also, during truncate, discard_buffer will have marked all - * the page's buffers clean. We discover that here and clean - * the page also. - */ - if (test_clear_page_dirty(page)) - task_io_account_cancelled_write(PAGE_CACHE_SIZE); - } out: if (buffers_to_free) { struct buffer_head *bh = buffers_to_free; diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index ed2c223..4f4cd13 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -176,7 +176,7 @@ static int hugetlbfs_commit_write(struct static void truncate_huge_page(struct page *page) { - clear_page_dirty(page); + cancel_dirty_page(page, /* No IO accounting for huge pages? */0); ClearPageUptodate(page); remove_from_page_cache(page); put_page(page); diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h index 9d774d0..8879f1d 100644 --- a/include/asm-generic/pgtable.h +++ b/include/asm-generic/pgtable.h @@ -61,31 +61,6 @@ ({ \ }) #endif -#ifndef __HAVE_ARCH_PTEP_TEST_AND_CLEAR_DIRTY -#define ptep_test_and_clear_dirty(__vma, __address, __ptep) \ -({ \ - pte_t __pte = *__ptep; \ - int r = 1; \ - if (!pte_dirty(__pte)) \ - r = 0; \ - else \ - set_pte_at((__vma)->vm_mm, (__address), (__ptep), \ - pte_mkclean(__pte)); \ - r; \ -}) -#endif - -#ifndef __HAVE_ARCH_PTEP_CLEAR_DIRTY_FLUSH -#define ptep_clear_flush_dirty(__vma, __address, __ptep) \ -({ \ - int __dirty; \ - __dirty = ptep_test_and_clear_dirty(__vma, __address, __ptep); \ - if (__dirty) \ - flush_tlb_page(__vma, __address); \ - __dirty; \ -}) -#endif - #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR #define ptep_get_and_clear(__mm, __address, __ptep) \ ({ \ diff --git a/include/asm-i386/pgtable.h b/include/asm-i386/pgtable.h index e6a4723..b61d6f9 100644 --- a/include/asm-i386/pgtable.h +++ b/include/asm-i386/pgtable.h @@ -300,18 +300,20 @@ do { \ flush_tlb_page(vma, address); \ } while (0) -#define __HAVE_ARCH_PTEP_CLEAR_DIRTY_FLUSH -#define ptep_clear_flush_dirty(vma, address, ptep) \ -({ \ - int __dirty; \ - __dirty = pte_dirty(*(ptep)); \ - if (__dirty) { \ - clear_bit(_PAGE_BIT_DIRTY, &(ptep)->pte_low); \ - pte_update_defer((vma)->vm_mm, (address), (ptep)); \ - flush_tlb_page(vma, address); \ - } \ - __dirty; \ -}) +/* + * "ptep_exchange()" can be used to atomically change a set of + * page table protection bits, returning the old ones (the dirty + * and accessed bits in particular, since they are set by hw). + * + * "ptep_flush_dirty()" then returns the dirty status of the + * page (on x86-64, we just look at the dirty bit in the returned + * pte, but some other architectures have the dirty bits in + * other places than the page tables). + */ +#define ptep_exchange(vma, address, ptep, old, new) \ + (old).pte_low = xchg(&(ptep)->pte_low, (new).pte_low); +#define ptep_flush_dirty(vma, address, ptep, old) \ + pte_dirty(old) #define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH #define ptep_clear_flush_young(vma, address, ptep) \ diff --git a/include/asm-x86_64/pgtable.h b/include/asm-x86_64/pgtable.h index 59901c6..07754b5 100644 --- a/include/asm-x86_64/pgtable.h +++ b/include/asm-x86_64/pgtable.h @@ -283,12 +283,20 @@ static inline pte_t pte_clrhuge(pte_t pt struct vm_area_struct; -static inline int ptep_test_and_clear_dirty(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep) -{ - if (!pte_dirty(*ptep)) - return 0; - return test_and_clear_bit(_PAGE_BIT_DIRTY, &ptep->pte); -} +/* + * "ptep_exchange()" can be used to atomically change a set of + * page table protection bits, returning the old ones (the dirty + * and accessed bits in particular, since they are set by hw). + * + * "ptep_flush_dirty()" then returns the dirty status of the + * page (on x86-64, we just look at the dirty bit in the returned + * pte, but some other architectures have the dirty bits in + * other places than the page tables). + */ +#define ptep_exchange(vma, address, ptep, old, new) \ + (old).pte = xchg(&(ptep)->pte, (new).pte); +#define ptep_flush_dirty(vma, address, ptep, old) \ + pte_dirty(old) static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep) { diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 4830a3b..350878a 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -253,15 +253,11 @@ #define ClearPageUncached(page) clear_bi struct page; /* forward declaration */ -int test_clear_page_dirty(struct page *page); +extern void cancel_dirty_page(struct page *page, unsigned int account_size); + int test_clear_page_writeback(struct page *page); int test_set_page_writeback(struct page *page); -static inline void clear_page_dirty(struct page *page) -{ - test_clear_page_dirty(page); -} - static inline void set_page_writeback(struct page *page) { test_set_page_writeback(page); diff --git a/mm/memory.c b/mm/memory.c index c00bac6..79cecab 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1842,6 +1842,33 @@ void unmap_mapping_range(struct address_ } EXPORT_SYMBOL(unmap_mapping_range); +static void check_last_page(struct address_space *mapping, loff_t size) +{ + pgoff_t index; + unsigned int offset; + struct page *page; + + if (!mapping) + return; + offset = size & ~PAGE_MASK; + if (!offset) + return; + index = size >> PAGE_SHIFT; + page = find_lock_page(mapping, index); + if (page) { + unsigned int check = 0; + unsigned char *kaddr = kmap_atomic(page, KM_USER0); + do { + check += kaddr[offset++]; + } while (offset < PAGE_SIZE); + kunmap_atomic(kaddr,KM_USER0); + unlock_page(page); + page_cache_release(page); + if (check) + printk("%s: BADNESS: truncate check %u\n", current->comm, check); + } +} + /** * vmtruncate - unmap mappings "freed" by truncate() syscall * @inode: inode of the file used @@ -1875,6 +1902,7 @@ do_expand: goto out_sig; if (offset > inode->i_sb->s_maxbytes) goto out_big; + check_last_page(mapping, inode->i_size); i_size_write(inode, offset); out_truncate: diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 237107c..b3a198c 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -845,38 +845,6 @@ int set_page_dirty_lock(struct page *pag EXPORT_SYMBOL(set_page_dirty_lock); /* - * Clear a page's dirty flag, while caring for dirty memory accounting. - * Returns true if the page was previously dirty. - */ -int test_clear_page_dirty(struct page *page) -{ - struct address_space *mapping = page_mapping(page); - unsigned long flags; - - if (!mapping) - return TestClearPageDirty(page); - - write_lock_irqsave(&mapping->tree_lock, flags); - if (TestClearPageDirty(page)) { - radix_tree_tag_clear(&mapping->page_tree, - page_index(page), PAGECACHE_TAG_DIRTY); - write_unlock_irqrestore(&mapping->tree_lock, flags); - /* - * We can continue to use `mapping' here because the - * page is locked, which pins the address_space - */ - if (mapping_cap_account_dirty(mapping)) { - page_mkclean(page); - dec_zone_page_state(page, NR_FILE_DIRTY); - } - return 1; - } - write_unlock_irqrestore(&mapping->tree_lock, flags); - return 0; -} -EXPORT_SYMBOL(test_clear_page_dirty); - -/* * Clear a page's dirty flag, while caring for dirty memory accounting. * Returns true if the page was previously dirty. * diff --git a/mm/rmap.c b/mm/rmap.c index d8a842a..a028803 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -432,7 +432,7 @@ static int page_mkclean_one(struct page { struct mm_struct *mm = vma->vm_mm; unsigned long address; - pte_t *pte, entry; + pte_t *ptep; spinlock_t *ptl; int ret = 0; @@ -440,22 +440,24 @@ static int page_mkclean_one(struct page if (address == -EFAULT) goto out; - pte = page_check_address(page, mm, address, &ptl); - if (!pte) - goto out; - - if (!pte_dirty(*pte) && !pte_write(*pte)) - goto unlock; - - entry = ptep_get_and_clear(mm, address, pte); - entry = pte_mkclean(entry); - entry = pte_wrprotect(entry); - ptep_establish(vma, address, pte, entry); - lazy_mmu_prot_update(entry); - ret = 1; - -unlock: - pte_unmap_unlock(pte, ptl); + ptep = page_check_address(page, mm, address, &ptl); + if (ptep) { + pte_t old, new; + + old = *ptep; + new = pte_wrprotect(pte_mkclean(old)); + if (!pte_same(old, new)) { + for (;;) { + flush_cache_page(vma, address, page_to_pfn(page)); + ptep_exchange(vma, address, ptep, old, new); + if (pte_same(old, new)) + break; + ret |= ptep_flush_dirty(vma, address, ptep, old); + flush_tlb_page(vma, address); + } + } + pte_unmap_unlock(pte, ptl); + } out: return ret; } diff --git a/mm/truncate.c b/mm/truncate.c index 9bfb8e8..4a38dd1 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -51,6 +51,22 @@ static inline void truncate_partial_page do_invalidatepage(page, partial); } +void cancel_dirty_page(struct page *page, unsigned int account_size) +{ + /* If we're cancelling the page, it had better not be mapped any more */ + if (page_mapped(page)) { + static unsigned int warncount; + + WARN_ON(++warncount < 5); + } + + if (TestClearPageDirty(page) && account_size && + mapping_cap_account_dirty(page->mapping)) { + dec_zone_page_state(page, NR_FILE_DIRTY); + task_io_account_cancelled_write(account_size); + } +} + /* * If truncate cannot remove the fs-private metadata from the page, the page * becomes anonymous. It will be left on the LRU and may even be mapped into @@ -67,11 +83,11 @@ truncate_complete_page(struct address_sp if (page->mapping != mapping) return; + cancel_dirty_page(page, PAGE_CACHE_SIZE); + if (PagePrivate(page)) do_invalidatepage(page, 0); - if (test_clear_page_dirty(page)) - task_io_account_cancelled_write(PAGE_CACHE_SIZE); ClearPageUptodate(page); ClearPageMappedToDisk(page); remove_from_page_cache(page); @@ -350,7 +366,6 @@ int invalidate_inode_pages2_range(struct for (i = 0; !ret && i < pagevec_count(&pvec); i++) { struct page *page = pvec.pages[i]; pgoff_t page_index; - int was_dirty; lock_page(page); if (page->mapping != mapping) { @@ -386,12 +401,8 @@ int invalidate_inode_pages2_range(struct PAGE_CACHE_SIZE, 0); } } - was_dirty = test_clear_page_dirty(page); - if (!invalidate_complete_page2(mapping, page)) { - if (was_dirty) - set_page_dirty(page); + if (!invalidate_complete_page2(mapping, page)) ret = -EIO; - } unlock_page(page); } pagevec_release(&pvec); On Fri, 2006-12-22 at 02:17 -0800, Andrew Morton wrote: > On Fri, 22 Dec 2006 11:00:04 +0100 > Martin Michlmayr <tbm@cyrius.com> wrote: > > > > - if (TestClearPageDirty(page) && account_size) > > > + if (TestClearPageDirty(page) && account_size) { > > > + dec_zone_page_state(page, NR_FILE_DIRTY); > > > task_io_account_cancelled_write(account_size); > > > + } > > > > This hunk (on top of git from about 2 days ago and your latest patch) > > results in the installer hanging right at the start. > > You'll need this also: > > From: Andrew Morton <akpm@osdl.org> > > Only (un)account for IO and page-dirtying for devices which have real backing > store (ie: not tmpfs or ramdisks). > > Cc: "David S. Miller" <davem@davemloft.net> > Cc: Linus Torvalds <torvalds@osdl.org> > Signed-off-by: Andrew Morton <akpm@osdl.org> > --- > > mm/truncate.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff -puN mm/truncate.c~truncate-dirty-memory-accounting-fix mm/truncate.c > --- a/mm/truncate.c~truncate-dirty-memory-accounting-fix > +++ a/mm/truncate.c > @@ -60,7 +60,8 @@ void cancel_dirty_page(struct page *page > WARN_ON(++warncount < 5); > } > > - if (TestClearPageDirty(page) && account_size) { > + if (TestClearPageDirty(page) && account_size && > + mapping_cap_account_dirty(page->mapping)) { > dec_zone_page_state(page, NR_FILE_DIRTY); > task_io_account_cancelled_write(account_size); > } > _ > ^ permalink raw reply related [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-22 12:24 ` Andrei Popa @ 2006-12-22 12:32 ` Martin Michlmayr 2006-12-22 12:59 ` Martin Michlmayr ` (2 more replies) 0 siblings, 3 replies; 311+ messages in thread From: Martin Michlmayr @ 2006-12-22 12:32 UTC (permalink / raw) To: Andrei Popa Cc: Andrew Morton, Linus Torvalds, Gordon Farquharson, Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven, Linux Kernel Mailing List * Andrei Popa <andrei.popa@i-neo.ro> [2006-12-22 14:24]: > With all three patches I have corruption.... I've completed one installation with Linus' patch plus the two from Andrew successfully, but I'm currently trying again... but I really need a better testcase since an installation takes about an hour. Andrei, which torrent do you download as a testcase? It would be good if someone could suggest a torrent which is legal and not too large. -- Martin Michlmayr http://www.cyrius.com/ ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-22 12:32 ` Martin Michlmayr @ 2006-12-22 12:59 ` Martin Michlmayr 2006-12-22 13:25 ` Peter Zijlstra 2006-12-22 15:01 ` [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) Patrick Mau 2006-12-23 8:15 ` Andrei Popa 2 siblings, 1 reply; 311+ messages in thread From: Martin Michlmayr @ 2006-12-22 12:59 UTC (permalink / raw) To: Andrei Popa Cc: Andrew Morton, Linus Torvalds, Gordon Farquharson, Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven, Linux Kernel Mailing List * Martin Michlmayr <tbm@cyrius.com> [2006-12-22 13:32]: > I've completed one installation with Linus' patch plus the two from > Andrew successfully, but I'm currently trying again... ... and it failed. -- Martin Michlmayr http://www.cyrius.com/ ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-22 12:59 ` Martin Michlmayr @ 2006-12-22 13:25 ` Peter Zijlstra 2006-12-22 13:29 ` Peter Zijlstra ` (2 more replies) 0 siblings, 3 replies; 311+ messages in thread From: Peter Zijlstra @ 2006-12-22 13:25 UTC (permalink / raw) To: Martin Michlmayr Cc: Andrei Popa, Andrew Morton, Linus Torvalds, Gordon Farquharson, Hugh Dickins, Nick Piggin, Arjan van de Ven, Linux Kernel Mailing List On Fri, 2006-12-22 at 13:59 +0100, Martin Michlmayr wrote: > * Martin Michlmayr <tbm@cyrius.com> [2006-12-22 13:32]: > > I've completed one installation with Linus' patch plus the two from > > Andrew successfully, but I'm currently trying again... > > .... and it failed. Since you are on ARM you might want to try with the page_mkclean_one cleanup patch too. Arjan agreed that the loop is not needed; we clear the pte, flush on all CPUs and then re-establish the pte. Any race will fault and be serialised on the pte lock. FWIW - with todays -git and Andrews second cancel_dirty_page() patch: http://lkml.org/lkml/2006/12/22/49 I am unable to trigger any corruption - I could again earlier by raising the number of seeds from 3 to 6. (am currently at 10 seeds) From: Peter Zijlstra <a.p.zijlstra@chello.nl> fix page_mkclean_one() - add flush_cache_page() for all those virtual indexed cache architectures. - handle s390. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- mm/rmap.c | 38 +++++++++++++++++++++++++------------- 1 file changed, 25 insertions(+), 13 deletions(-) Index: linux-2.6/mm/rmap.c =================================================================== --- linux-2.6.orig/mm/rmap.c +++ linux-2.6/mm/rmap.c @@ -432,7 +432,7 @@ static int page_mkclean_one(struct page { struct mm_struct *mm = vma->vm_mm; unsigned long address; - pte_t *pte, entry; + pte_t *pte; spinlock_t *ptl; int ret = 0; @@ -444,17 +444,18 @@ static int page_mkclean_one(struct page if (!pte) goto out; - if (!pte_dirty(*pte) && !pte_write(*pte)) - goto unlock; + if (pte_dirty(*pte) || pte_write(*pte)) { + pte_t entry; - entry = ptep_get_and_clear(mm, address, pte); - entry = pte_mkclean(entry); - entry = pte_wrprotect(entry); - ptep_establish(vma, address, pte, entry); - lazy_mmu_prot_update(entry); - ret = 1; + flush_cache_page(vma, address, pte_pfn(*pte)); + entry = ptep_clear_flush(vma, address, pte); + entry = pte_wrprotect(entry); + entry = pte_mkclean(entry); + set_pte_at(vma, address, pte, entry); + lazy_mmu_prot_update(entry); + ret = 1; + } -unlock: pte_unmap_unlock(pte, ptl); out: return ret; @@ -489,6 +490,8 @@ int page_mkclean(struct page *page) if (mapping) ret = page_mkclean_file(mapping, page); } + if (page_test_and_clear_dirty(page)) + ret = 1; return ret; } ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-22 13:25 ` Peter Zijlstra @ 2006-12-22 13:29 ` Peter Zijlstra 2006-12-22 17:56 ` Linus Torvalds 2006-12-22 19:20 ` Martin Michlmayr 2 siblings, 0 replies; 311+ messages in thread From: Peter Zijlstra @ 2006-12-22 13:29 UTC (permalink / raw) To: Martin Michlmayr Cc: Andrei Popa, Andrew Morton, Linus Torvalds, Gordon Farquharson, Hugh Dickins, Nick Piggin, Arjan van de Ven, Linux Kernel Mailing List A cleanup of try_to_unmap. I have not identified any races that this would solve, but for consistencies sake. Also includes a small s390 optimization by moving page_test_and_clear_dirty() out of the vma iteration. From: Peter Zijlstra <a.p.zijlstra@chello.nl> We clear the page in the following sequence: ClearPageDirty - lock ptl, clear pte, unlock ptl hence we should dirty in the opposite order: lock ptl, clear pte, unlock ptl - SetPageDirty try_to_unmap_one violates this by doing the SetPageDirty under the ptl. Also move page_test_and_clear_dirty() to try_to_unmap(). Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- mm/rmap.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) Index: linux-2.6/mm/rmap.c =================================================================== --- linux-2.6.orig/mm/rmap.c +++ linux-2.6/mm/rmap.c @@ -590,8 +590,6 @@ void page_remove_rmap(struct page *page) * Leaving it set also helps swapoff to reinstate ptes * faster for those pages still in swapcache. */ - if (page_test_and_clear_dirty(page)) - set_page_dirty(page); __dec_zone_page_state(page, PageAnon(page) ? NR_ANON_PAGES : NR_FILE_MAPPED); } @@ -610,6 +608,7 @@ static int try_to_unmap_one(struct page pte_t pteval; spinlock_t *ptl; int ret = SWAP_AGAIN; + struct page *dirty_page = NULL; address = vma_address(page, vma); if (address == -EFAULT) @@ -636,7 +635,7 @@ static int try_to_unmap_one(struct page /* Move the dirty bit to the physical page now the pte is gone. */ if (pte_dirty(pteval)) - set_page_dirty(page); + dirty_page = page; /* Update high watermark before we lower rss */ update_hiwater_rss(mm); @@ -687,6 +686,8 @@ static int try_to_unmap_one(struct page out_unmap: pte_unmap_unlock(pte, ptl); + if (dirty_page) + set_page_dirty(dirty_page); out: return ret; } @@ -918,6 +919,9 @@ int try_to_unmap(struct page *page, int else ret = try_to_unmap_file(page, migration); + if (page_test_and_clear_dirty(page)) + set_page_dirty(page); + if (!page_mapped(page)) ret = SWAP_SUCCESS; return ret; ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-22 13:25 ` Peter Zijlstra 2006-12-22 13:29 ` Peter Zijlstra @ 2006-12-22 17:56 ` Linus Torvalds 2006-12-22 19:20 ` Martin Michlmayr 2 siblings, 0 replies; 311+ messages in thread From: Linus Torvalds @ 2006-12-22 17:56 UTC (permalink / raw) To: Peter Zijlstra Cc: Martin Michlmayr, Andrei Popa, Andrew Morton, Gordon Farquharson, Hugh Dickins, Nick Piggin, Arjan van de Ven, Linux Kernel Mailing List On Fri, 22 Dec 2006, Peter Zijlstra wrote: > > fix page_mkclean_one() > > - add flush_cache_page() for all those virtual indexed cache > architectures. I think the flush_cache_page() should be after we've actually flushed it from the TLB and re-inserted it (this is one reason why I did the "ptep_exchange()" version of this). Otherwise somebody can still write to the page _after_ the cache flush.. > - handle s390. Yeah, that looks like the proper way to handle that. That said, it looks like we still see corruption. You may not, but Martin and Andrei still report problems, even with all the patches (including the last one from Andrew that avoids "dirty" going negative under some circumstances, and explains the "slow and/or never completed" case that Gordon and Martin saw). The good news is that I think the code now is cleaner and more understandable. The bad news is that nothing we've ever tried seems to have fixed the _problem_. And I don't think it's page_mkclean(). Especially not since the ARM people are seeing this under UP without PREEMPT. In that kind of schenario, the only possible races tend to be from things that actually block: "set_page_dirty()" (which blocks on IO in balancing), memory allocations, and obviously doing actual IO. And it's not a virtual cache problem, since others see it on x86. Of course, since it's quite possibly two different issues, maybe the virtual cache flush is required in order to force write-back to memory (which in turn is required for the DMA for the actual write!). So the ARM issue certainly could be due to the flush_cache_page() thing... Linus ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-22 13:25 ` Peter Zijlstra 2006-12-22 13:29 ` Peter Zijlstra 2006-12-22 17:56 ` Linus Torvalds @ 2006-12-22 19:20 ` Martin Michlmayr 2006-12-24 8:10 ` Gordon Farquharson 2 siblings, 1 reply; 311+ messages in thread From: Martin Michlmayr @ 2006-12-22 19:20 UTC (permalink / raw) To: Peter Zijlstra Cc: Andrei Popa, Andrew Morton, Linus Torvalds, Gordon Farquharson, Hugh Dickins, Nick Piggin, Arjan van de Ven, Linux Kernel Mailing List [-- Attachment #1: Type: text/plain, Size: 399 bytes --] * Peter Zijlstra <a.p.zijlstra@chello.nl> [2006-12-22 14:25]: > > .... and it failed. > Since you are on ARM you might want to try with the page_mkclean_one > cleanup patch too. I've already tried it and it didn't work. I just tried it again together with Linus' patch and the two from Andrew and it still fails. (For reference, the patch is attached.) -- Martin Michlmayr http://www.cyrius.com/ [-- Attachment #2: p --] [-- Type: text/plain, Size: 7798 bytes --] --- a/fs/buffer.c +++ b/fs/buffer.c @@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *pag int ret = 0; BUG_ON(!PageLocked(page)); - if (PageWriteback(page)) + if (PageDirty(page) || PageWriteback(page)) return 0; if (mapping == NULL) { /* can this still happen? */ @@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *pag spin_lock(&mapping->private_lock); ret = drop_buffers(page, &buffers_to_free); spin_unlock(&mapping->private_lock); - if (ret) { - /* - * If the filesystem writes its buffers by hand (eg ext3) - * then we can have clean buffers against a dirty page. We - * clean the page here; otherwise later reattachment of buffers - * could encounter a non-uptodate page, which is unresolvable. - * This only applies in the rare case where try_to_free_buffers - * succeeds but the page is not freed. - * - * Also, during truncate, discard_buffer will have marked all - * the page's buffers clean. We discover that here and clean - * the page also. - */ - if (test_clear_page_dirty(page)) - task_io_account_cancelled_write(PAGE_CACHE_SIZE); - } out: if (buffers_to_free) { struct buffer_head *bh = buffers_to_free; diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index ed2c223..4f4cd13 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -176,7 +176,7 @@ static int hugetlbfs_commit_write(struct static void truncate_huge_page(struct page *page) { - clear_page_dirty(page); + cancel_dirty_page(page, /* No IO accounting for huge pages? */0); ClearPageUptodate(page); remove_from_page_cache(page); put_page(page); diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 4830a3b..350878a 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -253,15 +253,11 @@ #define ClearPageUncached(page) clear_bi struct page; /* forward declaration */ -int test_clear_page_dirty(struct page *page); +extern void cancel_dirty_page(struct page *page, unsigned int account_size); + int test_clear_page_writeback(struct page *page); int test_set_page_writeback(struct page *page); -static inline void clear_page_dirty(struct page *page) -{ - test_clear_page_dirty(page); -} - static inline void set_page_writeback(struct page *page) { test_set_page_writeback(page); diff --git a/mm/memory.c b/mm/memory.c index c00bac6..79cecab 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1842,6 +1842,33 @@ void unmap_mapping_range(struct address_ } EXPORT_SYMBOL(unmap_mapping_range); +static void check_last_page(struct address_space *mapping, loff_t size) +{ + pgoff_t index; + unsigned int offset; + struct page *page; + + if (!mapping) + return; + offset = size & ~PAGE_MASK; + if (!offset) + return; + index = size >> PAGE_SHIFT; + page = find_lock_page(mapping, index); + if (page) { + unsigned int check = 0; + unsigned char *kaddr = kmap_atomic(page, KM_USER0); + do { + check += kaddr[offset++]; + } while (offset < PAGE_SIZE); + kunmap_atomic(kaddr,KM_USER0); + unlock_page(page); + page_cache_release(page); + if (check) + printk("%s: BADNESS: truncate check %u\n", current->comm, check); + } +} + /** * vmtruncate - unmap mappings "freed" by truncate() syscall * @inode: inode of the file used @@ -1875,6 +1902,7 @@ do_expand: goto out_sig; if (offset > inode->i_sb->s_maxbytes) goto out_big; + check_last_page(mapping, inode->i_size); i_size_write(inode, offset); out_truncate: diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 237107c..b3a198c 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -845,38 +845,6 @@ int set_page_dirty_lock(struct page *pag EXPORT_SYMBOL(set_page_dirty_lock); /* - * Clear a page's dirty flag, while caring for dirty memory accounting. - * Returns true if the page was previously dirty. - */ -int test_clear_page_dirty(struct page *page) -{ - struct address_space *mapping = page_mapping(page); - unsigned long flags; - - if (!mapping) - return TestClearPageDirty(page); - - write_lock_irqsave(&mapping->tree_lock, flags); - if (TestClearPageDirty(page)) { - radix_tree_tag_clear(&mapping->page_tree, - page_index(page), PAGECACHE_TAG_DIRTY); - write_unlock_irqrestore(&mapping->tree_lock, flags); - /* - * We can continue to use `mapping' here because the - * page is locked, which pins the address_space - */ - if (mapping_cap_account_dirty(mapping)) { - page_mkclean(page); - dec_zone_page_state(page, NR_FILE_DIRTY); - } - return 1; - } - write_unlock_irqrestore(&mapping->tree_lock, flags); - return 0; -} -EXPORT_SYMBOL(test_clear_page_dirty); - -/* * Clear a page's dirty flag, while caring for dirty memory accounting. * Returns true if the page was previously dirty. * diff --git a/mm/rmap.c b/mm/rmap.c index d8a842a..3278b2a 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -432,7 +432,7 @@ static int page_mkclean_one(struct page { struct mm_struct *mm = vma->vm_mm; unsigned long address; - pte_t *pte, entry; + pte_t *pte; spinlock_t *ptl; int ret = 0; @@ -444,17 +444,18 @@ static int page_mkclean_one(struct page if (!pte) goto out; - if (!pte_dirty(*pte) && !pte_write(*pte)) - goto unlock; + if (pte_dirty(*pte) || pte_write(*pte)) { + pte_t entry; - entry = ptep_get_and_clear(mm, address, pte); - entry = pte_mkclean(entry); - entry = pte_wrprotect(entry); - ptep_establish(vma, address, pte, entry); - lazy_mmu_prot_update(entry); - ret = 1; + flush_cache_page(vma, address, pte_pfn(*pte)); + entry = ptep_clear_flush(vma, address, pte); + entry = pte_wrprotect(entry); + entry = pte_mkclean(entry); + set_pte_at(vma, address, pte, entry); + lazy_mmu_prot_update(entry); + ret = 1; + } -unlock: pte_unmap_unlock(pte, ptl); out: return ret; @@ -489,6 +490,8 @@ int page_mkclean(struct page *page) if (mapping) ret = page_mkclean_file(mapping, page); } + if (page_test_and_clear_dirty(page)) + ret = 1; return ret; } diff --git a/mm/truncate.c b/mm/truncate.c index 9bfb8e8..4a38dd1 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -51,6 +51,22 @@ static inline void truncate_partial_page do_invalidatepage(page, partial); } +void cancel_dirty_page(struct page *page, unsigned int account_size) +{ + /* If we're cancelling the page, it had better not be mapped any more */ + if (page_mapped(page)) { + static unsigned int warncount; + + WARN_ON(++warncount < 5); + } + + if (TestClearPageDirty(page) && account_size && + mapping_cap_account_dirty(page->mapping)) { + dec_zone_page_state(page, NR_FILE_DIRTY); + task_io_account_cancelled_write(account_size); + } +} + /* * If truncate cannot remove the fs-private metadata from the page, the page * becomes anonymous. It will be left on the LRU and may even be mapped into @@ -67,11 +83,11 @@ truncate_complete_page(struct address_sp if (page->mapping != mapping) return; + cancel_dirty_page(page, PAGE_CACHE_SIZE); + if (PagePrivate(page)) do_invalidatepage(page, 0); - if (test_clear_page_dirty(page)) - task_io_account_cancelled_write(PAGE_CACHE_SIZE); ClearPageUptodate(page); ClearPageMappedToDisk(page); remove_from_page_cache(page); @@ -350,7 +366,6 @@ int invalidate_inode_pages2_range(struct for (i = 0; !ret && i < pagevec_count(&pvec); i++) { struct page *page = pvec.pages[i]; pgoff_t page_index; - int was_dirty; lock_page(page); if (page->mapping != mapping) { @@ -386,12 +401,8 @@ int invalidate_inode_pages2_range(struct PAGE_CACHE_SIZE, 0); } } - was_dirty = test_clear_page_dirty(page); - if (!invalidate_complete_page2(mapping, page)) { - if (was_dirty) - set_page_dirty(page); + if (!invalidate_complete_page2(mapping, page)) ret = -EIO; - } unlock_page(page); } pagevec_release(&pvec); ^ permalink raw reply related [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-22 19:20 ` Martin Michlmayr @ 2006-12-24 8:10 ` Gordon Farquharson 2006-12-24 8:43 ` Linus Torvalds 0 siblings, 1 reply; 311+ messages in thread From: Gordon Farquharson @ 2006-12-24 8:10 UTC (permalink / raw) To: Martin Michlmayr Cc: Peter Zijlstra, Andrei Popa, Andrew Morton, Linus Torvalds, Hugh Dickins, Nick Piggin, Arjan van de Ven, Linux Kernel Mailing List On 12/22/06, Martin Michlmayr <tbm@cyrius.com> wrote: > * Peter Zijlstra <a.p.zijlstra@chello.nl> [2006-12-22 14:25]: > > > .... and it failed. > > Since you are on ARM you might want to try with the page_mkclean_one > > cleanup patch too. > > I've already tried it and it didn't work. I just tried it again > together with Linus' patch and the two from Andrew and it still fails. > (For reference, the patch is attached.) I can confirm this behaviour with 2.6.19 and the patches mentioned above (cumulative patch for 2.6.19 appended to the end of this email). Is there any way to provide any debugging information that may help solve the problem ? Would it help to know the nature of the corruption e.g. an analysis of the corruption in the file ? I have previously asked apt developers if they wanted to look at the corrupted cache files, but there were no takers then. BTW, I decided to try Linus's test program [1] on ARM (I don't think that anybody had tried it on ARM before). Since we see file corruption with 2.6.18 + [PATCH] mm: tracking shared dirty pages [2], I ran Linus's program on machines with the following setups: 2.6.18 + the following patches mm: tracking shared dirty pages [2] mm: balance dirty pages [3] mm: optimize the new mprotect() code a bit [4] mm: small cleanup of install_page() [5] mm: fixup do_wp_page() [6] mm: msync() cleanup [7] $ ./mm-test | od -x 0000000 aaaa aaaa aaaa aaaa aaaa 0000 0000 0000 0000020 0000 0000 5555 5555 5555 5555 5555 5555 0000040 5555 5555 5555 5555 0000050 2.6.18 (no mm patches) $ ./mm-test | od -x 0000000 aaaa aaaa aaaa aaaa aaaa aaaa aaaa aaaa 0000020 aaaa aaaa 5555 5555 5555 5555 5555 5555 0000040 5555 5555 5555 5555 0000050 I don't know if this helps at all. Gordon [1] http://lkml.org/lkml/2006/12/19/200 [2] http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=d08b3851da41d0ee60851f2c75b118e1f7a5fc89 [3] http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=edc79b2a46ed854595e40edcf3f8b37f9f14aa3f [4] http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=c1e6098b23bb46e2b488fe9a26f831f867157483 [5] http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=e88dd6c11c5aef74d8b74a062767add53315533b [6] http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=ee6a6457886a80415db209e87033b63f2b06558c [7] http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=204ec841fbea3e5138168edbc3a76d46747cc987 diff -Naupr linux-2.6.19.orig/fs/buffer.c linux-2.6.19/fs/buffer.c --- linux-2.6.19.orig/fs/buffer.c 2006-11-29 14:57:37.000000000 -0700 +++ linux-2.6.19/fs/buffer.c 2006-12-21 01:16:31.000000000 -0700 @@ -2832,7 +2832,7 @@ int try_to_free_buffers(struct page *pag int ret = 0; BUG_ON(!PageLocked(page)); - if (PageWriteback(page)) + if (PageDirty(page) || PageWriteback(page)) return 0; if (mapping == NULL) { /* can this still happen? */ @@ -2843,17 +2843,6 @@ int try_to_free_buffers(struct page *pag spin_lock(&mapping->private_lock); ret = drop_buffers(page, &buffers_to_free); spin_unlock(&mapping->private_lock); - if (ret) { - /* - * If the filesystem writes its buffers by hand (eg ext3) - * then we can have clean buffers against a dirty page. We - * clean the page here; otherwise later reattachment of buffers - * could encounter a non-uptodate page, which is unresolvable. - * This only applies in the rare case where try_to_free_buffers - * succeeds but the page is not freed. - */ - clear_page_dirty(page); - } out: if (buffers_to_free) { struct buffer_head *bh = buffers_to_free; diff -Naupr linux-2.6.19.orig/fs/hugetlbfs/inode.c linux-2.6.19/fs/hugetlbfs/inode.c --- linux-2.6.19.orig/fs/hugetlbfs/inode.c 2006-11-29 14:57:37.000000000 -0700 +++ linux-2.6.19/fs/hugetlbfs/inode.c 2006-12-21 01:15:21.000000000 -0700 @@ -176,7 +176,7 @@ static int hugetlbfs_commit_write(struct static void truncate_huge_page(struct page *page) { - clear_page_dirty(page); + cancel_dirty_page(page, /* No IO accounting for huge pages? */0); ClearPageUptodate(page); remove_from_page_cache(page); put_page(page); diff -Naupr linux-2.6.19.orig/include/linux/page-flags.h linux-2.6.19/include/linux/page-flags.h --- linux-2.6.19.orig/include/linux/page-flags.h 2006-11-29 14:57:37.000000000 -0700 +++ linux-2.6.19/include/linux/page-flags.h 2006-12-21 01:15:21.000000000 -0700 @@ -253,15 +253,11 @@ static inline void SetPageUptodate(struc struct page; /* forward declaration */ -int test_clear_page_dirty(struct page *page); +extern void cancel_dirty_page(struct page *page, unsigned int account_size); + int test_clear_page_writeback(struct page *page); int test_set_page_writeback(struct page *page); -static inline void clear_page_dirty(struct page *page) -{ - test_clear_page_dirty(page); -} - static inline void set_page_writeback(struct page *page) { test_set_page_writeback(page); diff -Naupr linux-2.6.19.orig/mm/memory.c linux-2.6.19/mm/memory.c --- linux-2.6.19.orig/mm/memory.c 2006-11-29 14:57:37.000000000 -0700 +++ linux-2.6.19/mm/memory.c 2006-12-21 01:15:21.000000000 -0700 @@ -1832,6 +1832,33 @@ void unmap_mapping_range(struct address_ } EXPORT_SYMBOL(unmap_mapping_range); +static void check_last_page(struct address_space *mapping, loff_t size) +{ + pgoff_t index; + unsigned int offset; + struct page *page; + + if (!mapping) + return; + offset = size & ~PAGE_MASK; + if (!offset) + return; + index = size >> PAGE_SHIFT; + page = find_lock_page(mapping, index); + if (page) { + unsigned int check = 0; + unsigned char *kaddr = kmap_atomic(page, KM_USER0); + do { + check += kaddr[offset++]; + } while (offset < PAGE_SIZE); + kunmap_atomic(kaddr,KM_USER0); + unlock_page(page); + page_cache_release(page); + if (check) + printk("%s: BADNESS: truncate check %u\n", current->comm, check); + } +} + /** * vmtruncate - unmap mappings "freed" by truncate() syscall * @inode: inode of the file used @@ -1865,6 +1892,7 @@ do_expand: goto out_sig; if (offset > inode->i_sb->s_maxbytes) goto out_big; + check_last_page(mapping, inode->i_size); i_size_write(inode, offset); out_truncate: diff -Naupr linux-2.6.19.orig/mm/page-writeback.c linux-2.6.19/mm/page-writeback.c --- linux-2.6.19.orig/mm/page-writeback.c 2006-11-29 14:57:37.000000000 -0700 +++ linux-2.6.19/mm/page-writeback.c 2006-12-21 01:26:53.000000000 -0700 @@ -843,39 +843,6 @@ int set_page_dirty_lock(struct page *pag EXPORT_SYMBOL(set_page_dirty_lock); /* - * Clear a page's dirty flag, while caring for dirty memory accounting. - * Returns true if the page was previously dirty. - */ -int test_clear_page_dirty(struct page *page) -{ - struct address_space *mapping = page_mapping(page); - unsigned long flags; - - if (mapping) { - write_lock_irqsave(&mapping->tree_lock, flags); - if (TestClearPageDirty(page)) { - radix_tree_tag_clear(&mapping->page_tree, - page_index(page), - PAGECACHE_TAG_DIRTY); - write_unlock_irqrestore(&mapping->tree_lock, flags); - /* - * We can continue to use `mapping' here because the - * page is locked, which pins the address_space - */ - if (mapping_cap_account_dirty(mapping)) { - page_mkclean(page); - dec_zone_page_state(page, NR_FILE_DIRTY); - } - return 1; - } - write_unlock_irqrestore(&mapping->tree_lock, flags); - return 0; - } - return TestClearPageDirty(page); -} -EXPORT_SYMBOL(test_clear_page_dirty); - -/* * Clear a page's dirty flag, while caring for dirty memory accounting. * Returns true if the page was previously dirty. * diff -Naupr linux-2.6.19.orig/mm/rmap.c linux-2.6.19/mm/rmap.c --- linux-2.6.19.orig/mm/rmap.c 2006-11-29 14:57:37.000000000 -0700 +++ linux-2.6.19/mm/rmap.c 2006-12-22 23:25:09.000000000 -0700 @@ -432,7 +432,7 @@ static int page_mkclean_one(struct page { struct mm_struct *mm = vma->vm_mm; unsigned long address; - pte_t *pte, entry; + pte_t *pte; spinlock_t *ptl; int ret = 0; @@ -444,17 +444,18 @@ static int page_mkclean_one(struct page if (!pte) goto out; - if (!pte_dirty(*pte) && !pte_write(*pte)) - goto unlock; + if (pte_dirty(*pte) || pte_write(*pte)) { + pte_t entry; - entry = ptep_get_and_clear(mm, address, pte); - entry = pte_mkclean(entry); - entry = pte_wrprotect(entry); - ptep_establish(vma, address, pte, entry); - lazy_mmu_prot_update(entry); - ret = 1; + flush_cache_page(vma, address, pte_pfn(*pte)); + entry = ptep_clear_flush(vma, address, pte); + entry = pte_wrprotect(entry); + entry = pte_mkclean(entry); + set_pte_at(vma, address, pte, entry); + lazy_mmu_prot_update(entry); + ret = 1; + } -unlock: pte_unmap_unlock(pte, ptl); out: return ret; @@ -489,6 +490,8 @@ int page_mkclean(struct page *page) if (mapping) ret = page_mkclean_file(mapping, page); } + if (page_test_and_clear_dirty(page)) + ret = 1; return ret; } @@ -587,8 +590,6 @@ void page_remove_rmap(struct page *page) * Leaving it set also helps swapoff to reinstate ptes * faster for those pages still in swapcache. */ - if (page_test_and_clear_dirty(page)) - set_page_dirty(page); __dec_zone_page_state(page, PageAnon(page) ? NR_ANON_PAGES : NR_FILE_MAPPED); } @@ -607,6 +608,7 @@ static int try_to_unmap_one(struct page pte_t pteval; spinlock_t *ptl; int ret = SWAP_AGAIN; + struct page *dirty_page = NULL; address = vma_address(page, vma); if (address == -EFAULT) @@ -633,7 +635,7 @@ static int try_to_unmap_one(struct page /* Move the dirty bit to the physical page now the pte is gone. */ if (pte_dirty(pteval)) - set_page_dirty(page); + dirty_page = page; /* Update high watermark before we lower rss */ update_hiwater_rss(mm); @@ -684,6 +686,8 @@ static int try_to_unmap_one(struct page out_unmap: pte_unmap_unlock(pte, ptl); + if (dirty_page) + set_page_dirty(dirty_page); out: return ret; } @@ -915,6 +919,9 @@ int try_to_unmap(struct page *page, int else ret = try_to_unmap_file(page, migration); + if (page_test_and_clear_dirty(page)) + set_page_dirty(page); + if (!page_mapped(page)) ret = SWAP_SUCCESS; return ret; diff -Naupr linux-2.6.19.orig/mm/truncate.c linux-2.6.19/mm/truncate.c --- linux-2.6.19.orig/mm/truncate.c 2006-11-29 14:57:37.000000000 -0700 +++ linux-2.6.19/mm/truncate.c 2006-12-23 13:21:42.000000000 -0700 @@ -50,6 +50,21 @@ static inline void truncate_partial_page do_invalidatepage(page, partial); } +void cancel_dirty_page(struct page *page, unsigned int account_size) +{ + /* If we're cancelling the page, it had better not be mapped any more */+ if (page_mapped(page)) { + static unsigned int warncount; + + WARN_ON(++warncount < 5); + } + + if (TestClearPageDirty(page) && account_size && + mapping_cap_account_dirty(page->mapping)) + dec_zone_page_state(page, NR_FILE_DIRTY); +} + + /* * If truncate cannot remove the fs-private metadata from the page, the page * becomes anonymous. It will be left on the LRU and may even be mapped into @@ -66,10 +81,11 @@ truncate_complete_page(struct address_sp if (page->mapping != mapping) return; + cancel_dirty_page(page, PAGE_CACHE_SIZE); + if (PagePrivate(page)) do_invalidatepage(page, 0); - clear_page_dirty(page); ClearPageUptodate(page); ClearPageMappedToDisk(page); remove_from_page_cache(page); @@ -348,7 +364,6 @@ int invalidate_inode_pages2_range(struct for (i = 0; !ret && i < pagevec_count(&pvec); i++) { struct page *page = pvec.pages[i]; pgoff_t page_index; - int was_dirty; lock_page(page); if (page->mapping != mapping) { @@ -384,12 +399,8 @@ int invalidate_inode_pages2_range(struct PAGE_CACHE_SIZE, 0); } } - was_dirty = test_clear_page_dirty(page); - if (!invalidate_complete_page2(mapping, page)) { - if (was_dirty) - set_page_dirty(page); + if (!invalidate_complete_page2(mapping, page)) ret = -EIO; - } unlock_page(page); } pagevec_release(&pvec); -- Gordon Farquharson ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-24 8:10 ` Gordon Farquharson @ 2006-12-24 8:43 ` Linus Torvalds 2006-12-24 8:57 ` Andrew Morton 2006-12-26 16:17 ` Tobias Diedrich 0 siblings, 2 replies; 311+ messages in thread From: Linus Torvalds @ 2006-12-24 8:43 UTC (permalink / raw) To: Gordon Farquharson Cc: Martin Michlmayr, Peter Zijlstra, Andrei Popa, Andrew Morton, Hugh Dickins, Nick Piggin, Arjan van de Ven, Linux Kernel Mailing List On Sun, 24 Dec 2006, Gordon Farquharson wrote: > > Is there any way to provide any debugging information that may help > solve the problem ? I think we have people working on this. I know I'm trying to even come up with an idea of what is going on. I don't think we know yet. > Would it help to know the nature of the corruption e.g. an analysis > of the corruption in the file ? I actually think we know that, because Andrei already gave details. The corruption seems to be basically a few pages that get zeroes at the end rather than the expected contents. That's consistent with the page being written out once, but then _not_ getting written out again despite being dirtied some more. But if you see ay other pattern, please holler, because that would be interesting. > BTW, I decided to try Linus's test program [1] on ARM (I don't think > that anybody had tried it on ARM before). You get the expected results, and in fact, I'd be very surprised if you didn't. It's something subtler than that going on. I now _suspect_ that we're talking about something like - we started a writeout. The IO is still pending, and the page was marked clean and is now in the "writeback" phase. - a write happens to the page, and the page gets marked dirty again. Marking the page dirty also marks all the _buffers_ in the page dirty, but they were actually already dirty, because the IO hasn't completed yet. - the IO from the _previous_ write completes, and marks the buffers clean again. And no, thatr's not actually what is going on. The thing is, we actually clear the buffer dirty bits when we start the IO, not when we end it, but I think it is going to be this _kind_ of situation, where we missed something, and marked it clean too late, and thus cleared a dirty bit. I don't think it's a page table issue any more, it just doesn't look likely with the ARM UP corruption. It's also not apparently even on a cacheline boundary, so it probably is really a dirty bit that got cleared wrogn due to some race with IO. But right now we're all clueless. I personally suspect it's not even a new bug: it's probably an old bug that simply didn't matter before. Linus ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-24 8:43 ` Linus Torvalds @ 2006-12-24 8:57 ` Andrew Morton 2006-12-24 9:26 ` Linus Torvalds ` (2 more replies) 2006-12-26 16:17 ` Tobias Diedrich 1 sibling, 3 replies; 311+ messages in thread From: Andrew Morton @ 2006-12-24 8:57 UTC (permalink / raw) To: Linus Torvalds Cc: Gordon Farquharson, Martin Michlmayr, Peter Zijlstra, Andrei Popa, Hugh Dickins, Nick Piggin, Arjan van de Ven, Linux Kernel Mailing List On Sun, 24 Dec 2006 00:43:54 -0800 (PST) Linus Torvalds <torvalds@osdl.org> wrote: > I now _suspect_ that we're talking about something like > > - we started a writeout. The IO is still pending, and the page was > marked clean and is now in the "writeback" phase. > - a write happens to the page, and the page gets marked dirty again. > Marking the page dirty also marks all the _buffers_ in the page dirty, > but they were actually already dirty, because the IO hasn't completed > yet. > - the IO from the _previous_ write completes, and marks the buffers clean > again. Some things for the testers to try, please: - mount the fs with ext2 with the no-buffer-head option. That means either: grub.conf: rootfstype=ext2 rootflags=nobh /etc/fstab: ext2 nobh - mount the fs with ext3 data=writeback, nobh grub.conf: rootfstype=ext3 rootflags=nobh,data=writeback (I hope this works) /etc/fstab: ext2 data=writeback,nobh if that still fails we can rule out buffer_head funnies. ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-24 8:57 ` Andrew Morton @ 2006-12-24 9:26 ` Linus Torvalds 2006-12-24 12:14 ` Andrei Popa 2006-12-24 14:05 ` Martin Michlmayr 2 siblings, 0 replies; 311+ messages in thread From: Linus Torvalds @ 2006-12-24 9:26 UTC (permalink / raw) To: Andrew Morton Cc: Gordon Farquharson, Martin Michlmayr, Peter Zijlstra, Andrei Popa, Hugh Dickins, Nick Piggin, Arjan van de Ven, Linux Kernel Mailing List On Sun, 24 Dec 2006, Andrew Morton wrote: > > > I now _suspect_ that we're talking about something like > > > > - we started a writeout. The IO is still pending, and the page was > > marked clean and is now in the "writeback" phase. > > - a write happens to the page, and the page gets marked dirty again. > > Marking the page dirty also marks all the _buffers_ in the page dirty, > > but they were actually already dirty, because the IO hasn't completed > > yet. > > - the IO from the _previous_ write completes, and marks the buffers clean > > again. > > Some things for the testers to try, please: > > - mount the fs with ext2 with the no-buffer-head option. That means either: [ snip snip ] This is definitely worth testing, but the exact schenario I outlined is probably not the thing that happens. It was really meant to be more of an exmple of the _kind_ of situation I think we might have. That would explain why we didn't see this before: we simply didn't mark pages clean all that aggressively, and an app like rtorrent would normally have caused its flushes to happen _synchronously_ by using msync() (even if the IO itself was done asynchronously, all the dirty bit stuff would be synchronous wrt any rtorrent behaviour). And the things that /did/ use to clean pages asynchronously (VM scanning) would always actually look at the "young" bit (aka "accessed") and not even touch the dirty bit if an application had accessed the page recently, so that basically avoided any likely races, because we'd touch the dirty bit ONLY if the page was "cold". So this is why I'm saying that it might be an old bug, and it would be just the new pattern of handling dirty bits that triggers it. But avoiding buffer heads and testing that part is worth doing. Just to remove one thing from the equation. Linus ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-24 8:57 ` Andrew Morton 2006-12-24 9:26 ` Linus Torvalds @ 2006-12-24 12:14 ` Andrei Popa 2006-12-24 12:26 ` Andrei Popa 2006-12-24 12:31 ` Andrew Morton 2006-12-24 14:05 ` Martin Michlmayr 2 siblings, 2 replies; 311+ messages in thread From: Andrei Popa @ 2006-12-24 12:14 UTC (permalink / raw) To: Andrew Morton Cc: Linus Torvalds, Gordon Farquharson, Martin Michlmayr, Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven, Linux Kernel Mailing List On Sun, 2006-12-24 at 00:57 -0800, Andrew Morton wrote: > On Sun, 24 Dec 2006 00:43:54 -0800 (PST) > Linus Torvalds <torvalds@osdl.org> wrote: > > > I now _suspect_ that we're talking about something like > > > > - we started a writeout. The IO is still pending, and the page was > > marked clean and is now in the "writeback" phase. > > - a write happens to the page, and the page gets marked dirty again. > > Marking the page dirty also marks all the _buffers_ in the page dirty, > > but they were actually already dirty, because the IO hasn't completed > > yet. > > - the IO from the _previous_ write completes, and marks the buffers clean > > again. > > Some things for the testers to try, please: > > - mount the fs with ext2 with the no-buffer-head option. That means either: > > grub.conf: rootfstype=ext2 rootflags=nobh > /etc/fstab: ext2 nobh ierdnac ~ # mount /dev/sda7 on / type ext2 (rw,noatime,nobh) I have corruption. > > - mount the fs with ext3 data=writeback, nobh > > grub.conf: rootfstype=ext3 rootflags=nobh,data=writeback (I hope this works) > /etc/fstab: ext2 data=writeback,nobh ierdnac ~ # mount /dev/sda7 on / type ext3 (rw,noatime,nobh) ierdnac ~ # dmesg|grep EXT3 EXT3-fs: mounted filesystem with writeback data mode. EXT3 FS on sda7, internal journal I don't have corruption. I tested twice. > > if that still fails we can rule out buffer_head funnies. > ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-24 12:14 ` Andrei Popa @ 2006-12-24 12:26 ` Andrei Popa 2006-12-24 12:30 ` Andrew Morton 2006-12-24 12:31 ` Andrew Morton 1 sibling, 1 reply; 311+ messages in thread From: Andrei Popa @ 2006-12-24 12:26 UTC (permalink / raw) To: Andrew Morton Cc: Linus Torvalds, Gordon Farquharson, Martin Michlmayr, Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven, Linux Kernel Mailing List On Sun, 2006-12-24 at 14:14 +0200, Andrei Popa wrote: > On Sun, 2006-12-24 at 00:57 -0800, Andrew Morton wrote: > > On Sun, 24 Dec 2006 00:43:54 -0800 (PST) > > Linus Torvalds <torvalds@osdl.org> wrote: > > > > > I now _suspect_ that we're talking about something like > > > > > > - we started a writeout. The IO is still pending, and the page was > > > marked clean and is now in the "writeback" phase. > > > - a write happens to the page, and the page gets marked dirty again. > > > Marking the page dirty also marks all the _buffers_ in the page dirty, > > > but they were actually already dirty, because the IO hasn't completed > > > yet. > > > - the IO from the _previous_ write completes, and marks the buffers clean > > > again. > > > > Some things for the testers to try, please: > > > > - mount the fs with ext2 with the no-buffer-head option. That means either: > > > > grub.conf: rootfstype=ext2 rootflags=nobh > > /etc/fstab: ext2 nobh > > ierdnac ~ # mount > /dev/sda7 on / type ext2 (rw,noatime,nobh) > > I have corruption. > > > > > - mount the fs with ext3 data=writeback, nobh > > > > grub.conf: rootfstype=ext3 rootflags=nobh,data=writeback (I hope this works) > > /etc/fstab: ext2 data=writeback,nobh > > ierdnac ~ # mount > /dev/sda7 on / type ext3 (rw,noatime,nobh) > > ierdnac ~ # dmesg|grep EXT3 > EXT3-fs: mounted filesystem with writeback data mode. > EXT3 FS on sda7, internal journal > > I don't have corruption. I tested twice. > I also tested with ext3 ordered, nobh and I have file corruption... > > > > if that still fails we can rule out buffer_head funnies. > > ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-24 12:26 ` Andrei Popa @ 2006-12-24 12:30 ` Andrew Morton 0 siblings, 0 replies; 311+ messages in thread From: Andrew Morton @ 2006-12-24 12:30 UTC (permalink / raw) To: andrei.popa Cc: Linus Torvalds, Gordon Farquharson, Martin Michlmayr, Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven, Linux Kernel Mailing List On Sun, 24 Dec 2006 14:26:01 +0200 Andrei Popa <andrei.popa@i-neo.ro> wrote: > I also tested with ext3 ordered, nobh and I have file corruption... ordered+nobh isn't a possible combination. The filesystem probably ignored nobh. nobh mode only makes sense with data=writeback. ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-24 12:14 ` Andrei Popa 2006-12-24 12:26 ` Andrei Popa @ 2006-12-24 12:31 ` Andrew Morton 2006-12-24 16:45 ` Andrei Popa 1 sibling, 1 reply; 311+ messages in thread From: Andrew Morton @ 2006-12-24 12:31 UTC (permalink / raw) To: andrei.popa Cc: Linus Torvalds, Gordon Farquharson, Martin Michlmayr, Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven, Linux Kernel Mailing List On Sun, 24 Dec 2006 14:14:38 +0200 Andrei Popa <andrei.popa@i-neo.ro> wrote: > > - mount the fs with ext2 with the no-buffer-head option. That means either: > > > > grub.conf: rootfstype=ext2 rootflags=nobh > > /etc/fstab: ext2 nobh > > ierdnac ~ # mount > /dev/sda7 on / type ext2 (rw,noatime,nobh) > > I have corruption. > > > > > - mount the fs with ext3 data=writeback, nobh > > > > grub.conf: rootfstype=ext3 rootflags=nobh,data=writeback (I hope this works) > > /etc/fstab: ext2 data=writeback,nobh > > ierdnac ~ # mount > /dev/sda7 on / type ext3 (rw,noatime,nobh) > > ierdnac ~ # dmesg|grep EXT3 > EXT3-fs: mounted filesystem with writeback data mode. > EXT3 FS on sda7, internal journal > > I don't have corruption. I tested twice. This is a surprising result. Can you pleas retest ext3 data=writeback,nobh? ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-24 12:31 ` Andrew Morton @ 2006-12-24 16:45 ` Andrei Popa 2006-12-24 17:16 ` Linus Torvalds 0 siblings, 1 reply; 311+ messages in thread From: Andrei Popa @ 2006-12-24 16:45 UTC (permalink / raw) To: Andrew Morton Cc: Linus Torvalds, Gordon Farquharson, Martin Michlmayr, Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven, Linux Kernel Mailing List On Sun, 2006-12-24 at 04:31 -0800, Andrew Morton wrote: > On Sun, 24 Dec 2006 14:14:38 +0200 > Andrei Popa <andrei.popa@i-neo.ro> wrote: > > > > - mount the fs with ext2 with the no-buffer-head option. That means either: > > > > > > grub.conf: rootfstype=ext2 rootflags=nobh > > > /etc/fstab: ext2 nobh > > > > ierdnac ~ # mount > > /dev/sda7 on / type ext2 (rw,noatime,nobh) > > > > I have corruption. > > > > > > > > - mount the fs with ext3 data=writeback, nobh > > > > > > grub.conf: rootfstype=ext3 rootflags=nobh,data=writeback (I hope this works) > > > /etc/fstab: ext2 data=writeback,nobh > > > > ierdnac ~ # mount > > /dev/sda7 on / type ext3 (rw,noatime,nobh) > > > > ierdnac ~ # dmesg|grep EXT3 > > EXT3-fs: mounted filesystem with writeback data mode. > > EXT3 FS on sda7, internal journal > > > > I don't have corruption. I tested twice. > > This is a surprising result. Can you pleas retest ext3 data=writeback,nobh? Yes, no corruption. Also tested only with data=writeback and had no corruption. ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-24 16:45 ` Andrei Popa @ 2006-12-24 17:16 ` Linus Torvalds 2006-12-24 18:07 ` Andrew Morton ` (2 more replies) 0 siblings, 3 replies; 311+ messages in thread From: Linus Torvalds @ 2006-12-24 17:16 UTC (permalink / raw) To: Andrei Popa Cc: Andrew Morton, Gordon Farquharson, Martin Michlmayr, Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven, Linux Kernel Mailing List On Sun, 24 Dec 2006, Andrei Popa wrote: > On Sun, 2006-12-24 at 04:31 -0800, Andrew Morton wrote: > > Andrei Popa <andrei.popa@i-neo.ro> wrote: > > > /dev/sda7 on / type ext3 (rw,noatime,nobh) > > > > > > I don't have corruption. I tested twice. > > > > This is a surprising result. Can you pleas retest ext3 data=writeback,nobh? > > Yes, no corruption. Also tested only with data=writeback and had no > corruption. Ok, so it would seem to be writeback related _somehow_. However, most of the differences (I _thought_) in ext3 actually show up only if you have *both* "nobh" and "data=writeback", and as far as I can tell, just a simple "data=writeback" should still use the bog-standard "block_write_full_page()". Andrew? Although as far as I can see, then ext2 should work as-is too (since it too also just uses "block_write_full_page()" without anything fancy). Strange. How about this particularly stupid diff? (please test with something that _would_ cause corruption normally). It is _entirely_ untested, but what it tries to do is to simply serialize any writeback in progress with any process that tries to re-map a shared page into its address space and dirty it. I haven't tested it, and maybe it misses some case, but it looks likea good way to try to avoid races with marking pages dirty and the writeback phase .. Linus --- diff --git a/mm/memory.c b/mm/memory.c index 563792f..64ed10b 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1544,6 +1544,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, if (!pte_same(*page_table, orig_pte)) goto unlock; } + wait_on_page_writeback(old_page); dirty_page = old_page; get_page(dirty_page); reuse = 1; @@ -2215,6 +2216,7 @@ retry: page_cache_release(new_page); return VM_FAULT_SIGBUS; } + wait_on_page_writeback(new_page); } } ^ permalink raw reply related [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-24 17:16 ` Linus Torvalds @ 2006-12-24 18:07 ` Andrew Morton 2006-12-24 18:37 ` Linus Torvalds 2006-12-24 19:27 ` Gordon Farquharson 2 siblings, 0 replies; 311+ messages in thread From: Andrew Morton @ 2006-12-24 18:07 UTC (permalink / raw) To: Linus Torvalds Cc: Andrei Popa, Gordon Farquharson, Martin Michlmayr, Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven, Linux Kernel Mailing List On Sun, 24 Dec 2006 09:16:06 -0800 (PST) Linus Torvalds <torvalds@osdl.org> wrote: > > > On Sun, 24 Dec 2006, Andrei Popa wrote: > > > On Sun, 2006-12-24 at 04:31 -0800, Andrew Morton wrote: > > > Andrei Popa <andrei.popa@i-neo.ro> wrote: > > > > /dev/sda7 on / type ext3 (rw,noatime,nobh) > > > > > > > > I don't have corruption. I tested twice. > > > > > > This is a surprising result. Can you pleas retest ext3 data=writeback,nobh? > > > > Yes, no corruption. Also tested only with data=writeback and had no > > corruption. > > Ok, so it would seem to be writeback related _somehow_. However, most of > the differences (I _thought_) in ext3 actually show up only if you have > *both* "nobh" and "data=writeback", and as far as I can tell, just a > simple "data=writeback" should still use the bog-standard > "block_write_full_page()". > > Andrew? > > Although as far as I can see, then ext2 should work as-is too (since it > too also just uses "block_write_full_page()" without anything fancy). ext2 uses the multipage-bio assembly code for writeback whereas ext3 doesn't. But ext3 doesn't use that code in data=ordered mode, of course. Still, this: --- a/fs/ext2/inode.c~a +++ a/fs/ext2/inode.c @@ -693,7 +693,7 @@ const struct address_space_operations ex .commit_write = generic_commit_write, .bmap = ext2_bmap, .direct_IO = ext2_direct_IO, - .writepages = ext2_writepages, +// .writepages = ext2_writepages, .migratepage = buffer_migrate_page, }; @@ -711,7 +711,7 @@ const struct address_space_operations ex .commit_write = nobh_commit_write, .bmap = ext2_bmap, .direct_IO = ext2_direct_IO, - .writepages = ext2_writepages, +// .writepages = ext2_writepages, .migratepage = buffer_migrate_page, }; _ will switch it off for ext2. > Strange. > > How about this particularly stupid diff? (please test with something that > _would_ cause corruption normally). > > It is _entirely_ untested, but what it tries to do is to simply serialize > any writeback in progress with any process that tries to re-map a shared > page into its address space and dirty it. I haven't tested it, and maybe > it misses some case, but it looks likea good way to try to avoid races > with marking pages dirty and the writeback phase .. > > Linus > --- > diff --git a/mm/memory.c b/mm/memory.c > index 563792f..64ed10b 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -1544,6 +1544,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, > if (!pte_same(*page_table, orig_pte)) > goto unlock; > } > + wait_on_page_writeback(old_page); > dirty_page = old_page; > get_page(dirty_page); > reuse = 1; > @@ -2215,6 +2216,7 @@ retry: > page_cache_release(new_page); > return VM_FAULT_SIGBUS; > } > + wait_on_page_writeback(new_page); > } > } yup. Also, we could perhaps lock the target page during pagefaults.. ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-24 17:16 ` Linus Torvalds 2006-12-24 18:07 ` Andrew Morton @ 2006-12-24 18:37 ` Linus Torvalds 2006-12-24 19:18 ` Linus Torvalds 2006-12-24 21:21 ` Michael S. Tsirkin 2006-12-24 19:27 ` Gordon Farquharson 2 siblings, 2 replies; 311+ messages in thread From: Linus Torvalds @ 2006-12-24 18:37 UTC (permalink / raw) To: Andrei Popa, Peter Zijlstra Cc: Andrew Morton, Gordon Farquharson, Martin Michlmayr, Hugh Dickins, Nick Piggin, Arjan van de Ven, Linux Kernel Mailing List On Sun, 24 Dec 2006, Linus Torvalds wrote: > > How about this particularly stupid diff? (please test with something that > _would_ cause corruption normally). Actually, here's an even more stupid diff, which actually to some degree seems to capture the real problem better. Peter, tell me I'm crazy, but with the new rules, the following condition is a bug: - shared mapping - writable - not already marked dirty in the PTE because that combination means that the hardware can mark the PTE dirty without us even realizing (and thus not marking the "struct page *" dirty). (The above is actually a valid situation for IO mappings, but not for "real" mappings. And IO mappings should never take page faults, I think). So, with that in mind, I wrote this stupid patch (for 32-bit x86, since I used my Mac Mini for testing ratehr than my main machine - but the x86-64 version should be pretty much identcal).. And you know what, Peter? It triggers for me. I get WARNING at mm/memory.c:2274 do_no_page() [<c0103d4a>] show_trace_log_lvl+0x1a/0x2f [<c010436c>] show_trace+0x12/0x14 [<c01043f0>] dump_stack+0x16/0x18 [<c0159790>] __handle_mm_fault+0x38d/0x919 [<c011c8c4>] do_page_fault+0x1ff/0x507 [<c03fabcc>] error_code+0x7c/0x84 which seems to say that do_no_page() can be used to insert shared and non-dirty, but still writable, pages. But maybe my patch is just bogus, and I didn't think it through. Peter, I realize it's Christmas Eve, but let's face it, Santa appreciates good boys and girls, and we all want tons of loot. So please be good, and waste some time looking at this and tell me why I'm either wrong, or there's a real smoking gun here.. ;) Linus --- diff --git a/include/asm-i386/pgtable.h b/include/asm-i386/pgtable.h index e6a4723..1389bb7 100644 --- a/include/asm-i386/pgtable.h +++ b/include/asm-i386/pgtable.h @@ -494,7 +494,13 @@ do { \ * The i386 doesn't have any external MMU info: the kernel page * tables contain all the necessary information. */ -#define update_mmu_cache(vma,address,pte) do { } while (0) +#define bad_shared_pte(pte) (pte_write(pte) && !pte_dirty(pte)) +#define update_mmu_cache(vma,address,pte) do { \ + static int __cnt; \ + WARN_ON(((vma)->vm_flags & VM_SHARED) \ + && bad_shared_pte(pte) \ + && ++__cnt < 5); \ +} while (0) #endif /* !__ASSEMBLY__ */ #ifdef CONFIG_FLATMEM ^ permalink raw reply related [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-24 18:37 ` Linus Torvalds @ 2006-12-24 19:18 ` Linus Torvalds 2006-12-24 20:55 ` Gordon Farquharson 2006-12-26 10:31 ` Nick Piggin 2006-12-24 21:21 ` Michael S. Tsirkin 1 sibling, 2 replies; 311+ messages in thread From: Linus Torvalds @ 2006-12-24 19:18 UTC (permalink / raw) To: Andrei Popa, Peter Zijlstra, David S. Miller Cc: Andrew Morton, Gordon Farquharson, Martin Michlmayr, Hugh Dickins, Nick Piggin, Arjan van de Ven, Linux Kernel Mailing List On Sun, 24 Dec 2006, Linus Torvalds wrote: > > Peter, tell me I'm crazy, but with the new rules, the following condition > is a bug: > > - shared mapping > - writable > - not already marked dirty in the PTE Ok, so how about this diff. I'm actually feeling good about this one. It really looks like "do_no_page()" was simply buggy, and that this explains everything. Please please please test. Throw all the other patches away (with the possible exception of the "update_mmu_cache()" sanity checker, which is still interesting in case some _other_ place does this too). Don't do the "wait_on_page_writeback()" thing, because it changes timings and might hide thngs for the wrong reasons. Just apply this on top of a known failing kernel, and test. Linus --- diff --git a/mm/memory.c b/mm/memory.c index 563792f..cf429c4 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2247,21 +2249,23 @@ retry: if (pte_none(*page_table)) { flush_icache_page(vma, new_page); entry = mk_pte(new_page, vma->vm_page_prot); - if (write_access) - entry = maybe_mkwrite(pte_mkdirty(entry), vma); - set_pte_at(mm, address, page_table, entry); if (anon) { inc_mm_counter(mm, anon_rss); lru_cache_add_active(new_page); page_add_new_anon_rmap(new_page, vma, address); + if (write_access) + entry = maybe_mkwrite(pte_mkdirty(entry), vma); } else { inc_mm_counter(mm, file_rss); page_add_file_rmap(new_page); + entry = pte_wrprotect(entry); if (write_access) { dirty_page = new_page; get_page(dirty_page); + entry = maybe_mkwrite(pte_mkdirty(entry), vma); } } + set_pte_at(mm, address, page_table, entry); } else { /* One of our sibling threads was faster, back out. */ page_cache_release(new_page); ^ permalink raw reply related [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-24 19:18 ` Linus Torvalds @ 2006-12-24 20:55 ` Gordon Farquharson 2006-12-26 10:31 ` Nick Piggin 1 sibling, 0 replies; 311+ messages in thread From: Gordon Farquharson @ 2006-12-24 20:55 UTC (permalink / raw) To: Linus Torvalds Cc: Andrei Popa, Peter Zijlstra, David S. Miller, Andrew Morton, Martin Michlmayr, Hugh Dickins, Nick Piggin, Arjan van de Ven, Linux Kernel Mailing List On 12/24/06, Linus Torvalds <torvalds@osdl.org> wrote: > Ok, so how about this diff. > > I'm actually feeling good about this one. It really looks like > "do_no_page()" was simply buggy, and that this explains everything. I tested with just this patch and 2.6.19 and no change. Sorry Linus, no early Christmas present :-( Gordon -- Gordon Farquharson ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-24 19:18 ` Linus Torvalds 2006-12-24 20:55 ` Gordon Farquharson @ 2006-12-26 10:31 ` Nick Piggin 2006-12-26 19:26 ` Linus Torvalds 1 sibling, 1 reply; 311+ messages in thread From: Nick Piggin @ 2006-12-26 10:31 UTC (permalink / raw) To: Linus Torvalds Cc: Andrei Popa, Peter Zijlstra, David S. Miller, Andrew Morton, Gordon Farquharson, Martin Michlmayr, Hugh Dickins, Arjan van de Ven, Linux Kernel Mailing List Linus Torvalds wrote: > > On Sun, 24 Dec 2006, Linus Torvalds wrote: > >>Peter, tell me I'm crazy, but with the new rules, the following condition >>is a bug: >> >> - shared mapping >> - writable >> - not already marked dirty in the PTE > > > Ok, so how about this diff. > > I'm actually feeling good about this one. It really looks like > "do_no_page()" was simply buggy, and that this explains everything. Still trying to catch up here, so I'm not going to reply to any old stuff and just start at the tip of the thread... Other than to say that I really like cancel_page_dirty ;) I think your patch is quite right so that's a good catch. But I'm not too surprised that it does not help the problem, because I don't think we have started shedding any old pte_dirty tests at unmap/reclaim-time, have we? So the dirty bit isn't going to get lost, as such. I was hoping that you've almost narrowed it down to the filesystem writeback code, with the last few mails? Nick > Please please please test. Throw all the other patches away (with the > possible exception of the "update_mmu_cache()" sanity checker, which is > still interesting in case some _other_ place does this too). > > Don't do the "wait_on_page_writeback()" thing, because it changes timings > and might hide thngs for the wrong reasons. Just apply this on top of a > known failing kernel, and test. > > Linus > > --- > diff --git a/mm/memory.c b/mm/memory.c > index 563792f..cf429c4 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -2247,21 +2249,23 @@ retry: > if (pte_none(*page_table)) { > flush_icache_page(vma, new_page); > entry = mk_pte(new_page, vma->vm_page_prot); > - if (write_access) > - entry = maybe_mkwrite(pte_mkdirty(entry), vma); > - set_pte_at(mm, address, page_table, entry); > if (anon) { > inc_mm_counter(mm, anon_rss); > lru_cache_add_active(new_page); > page_add_new_anon_rmap(new_page, vma, address); > + if (write_access) > + entry = maybe_mkwrite(pte_mkdirty(entry), vma); > } else { > inc_mm_counter(mm, file_rss); > page_add_file_rmap(new_page); > + entry = pte_wrprotect(entry); > if (write_access) { > dirty_page = new_page; > get_page(dirty_page); > + entry = maybe_mkwrite(pte_mkdirty(entry), vma); > } > } > + set_pte_at(mm, address, page_table, entry); > } else { > /* One of our sibling threads was faster, back out. */ > page_cache_release(new_page); > -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-26 10:31 ` Nick Piggin @ 2006-12-26 19:26 ` Linus Torvalds 2006-12-27 12:32 ` Jari Sundell ` (2 more replies) 0 siblings, 3 replies; 311+ messages in thread From: Linus Torvalds @ 2006-12-26 19:26 UTC (permalink / raw) To: Nick Piggin Cc: Andrei Popa, Peter Zijlstra, David S. Miller, Andrew Morton, Gordon Farquharson, Martin Michlmayr, Hugh Dickins, Arjan van de Ven, Linux Kernel Mailing List On Tue, 26 Dec 2006, Nick Piggin wrote: > Linus Torvalds wrote: > > > > Ok, so how about this diff. > > > > I'm actually feeling good about this one. It really looks like > > "do_no_page()" was simply buggy, and that this explains everything. > > Still trying to catch up here, so I'm not going to reply to any old > stuff and just start at the tip of the thread... Other than to say > that I really like cancel_page_dirty ;) Yeah, I think that part is a bit clearer about what's going on now. > I think your patch is quite right so that's a good catch. Actually, since people told me it didn't matter, I went back and looked at _why_ - the thing is, "vma->vm_page_prot" should always be read-only anyway, except for mappings that don't do dirty accounting at all, so I think my patch only found cases that are unimportant (ie pages that get faulted on on filesystems like ramfs that doesn't do any dirty page accounting because they're all dirty anyway). > But I'm not too surprised that it does not help the problem, because I > don't think we have started shedding any old pte_dirty tests at > unmap/reclaim-time, have we? So the dirty bit isn't going to get lost, > as such. True. We should no longer _need_ those dirty bit reclaims at unmap/reclaim, but we still do them, so you're right, even if we were buggy in this area, it should only really matter for the dirty page counting, not for any lost data. > I was hoping that you've almost narrowed it down to the filesystem > writeback code, with the last few mails? I think so, yes. However, I've checked, and "rtorrent" really does seem to be fairly well-behaved wrt any filesystem activity. It does - no threading. It's 100% single-threaded, and doesn't even appear to use signals. - exactly _one_ "ftruncate()", and it does it at the beginning, for the full final size. IOW, it's not anything subtle with truncate and dirty page cancel. - It never uses mprotect on the shared mappings, but it _does_ do: "mincore()" - but the return values don't much matter (it's used as a heuristic on which parts to hash, apparently) I double- and triple-checked this one, because I did make changes to "mincore()", but those didn't go into the affected kernels anyway (ie they are not in plain 2.6.19, nor in 2.6.18.3 either) "madvise(MADV_WILLNEED)" "msync(MS_ASYNC)" (or MS_SYNC if you use a command line flag) "munmap()" of course - it never seems to mix mmap() and write() - it does _only_ mmap. - it seems to mmap/munmap the shared files in nice 64-page chunks, all 64-page aligned in the file (ie it does NOT create one big mapping, it has some kind of LRU of thse 64-page chunks). The only exception being the last chunk, which it maps byte-accurate to the size. - I haven't checked whether it only ever has the same chunk mapped once at a time. Anyway, the _one_ half-way interesting thing is the fact that it doesn't allocate any backing store at all for the file, and as such the page writeback needs to create all the underlying buffers on the filesystem. I really don't see why that would be a problem either, but I could imagine that if we have some writeback bug where we can end up writing back the _same_ page concurrently, we'd actually end up racing in the kernel, and allocating two different backing stores, and then maybe the other one would effectively "get lost" (and the earlier writeback would win the race, explaining why we'd end up with zeroes at the end of a block). Or something. However, all the codepaths _seem_ to test for PG_writeback, and not even try to start another writeback while the first one is still active. What would also actually be interesting is whether somebody can reproduce this on Reiserfs, for example. I _think_ all the reports I've seen are on ext2 or ext3, and if this is somehow writeback-related, it could be some bug that is just shared between the two by virtue of them still having a lot of stuff in common. Linus ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-26 19:26 ` Linus Torvalds @ 2006-12-27 12:32 ` Jari Sundell 2006-12-27 12:44 ` valdyn 2007-01-07 2:06 ` Tom Lanyon 2 siblings, 0 replies; 311+ messages in thread From: Jari Sundell @ 2006-12-27 12:32 UTC (permalink / raw) To: Linus Torvalds Cc: Nick Piggin, Andrei Popa, Peter Zijlstra, David S. Miller, Andrew Morton, Gordon Farquharson, Martin Michlmayr, Hugh Dickins, Arjan van de Ven, Linux Kernel Mailing List [-- Attachment #1: Type: text/plain, Size: 1553 bytes --] On 12/27/06, Linus Torvalds <torvalds@osdl.org> wrote: <snip> > - It never uses mprotect on the shared mappings, but it _does_ do: > "mincore()" - but the return values don't much matter (it's used > as a heuristic on which parts to hash, apparently) > > I double- and triple-checked this one, because I > did make changes to "mincore()", but those didn't go > into the affected kernels anyway (ie they are not in > plain 2.6.19, nor in 2.6.18.3 either) Correct, mincore is only used to check if it should delay the hash checking. > "madvise(MADV_WILLNEED)" > "msync(MS_ASYNC)" (or MS_SYNC if you use a command line flag) > "munmap()" of course > > - it never seems to mix mmap() and write() - it does _only_ mmap. > > - it seems to mmap/munmap the shared files in nice 64-page chunks, all > 64-page aligned in the file (ie it does NOT create one big mapping, it > has some kind of LRU of thse 64-page chunks). The only exception being > the last chunk, which it maps byte-accurate to the size. The length of the chunks is only page aligned on single file torrents, not so on multi-file torrents. I've attached a patch for rtorrent that will extend the length to the page boundary. > - I haven't checked whether it only ever has the same chunk mapped once > at a time. This should be the case, but two mapped chunks may share a page, sometimes with different r/w permissions. Jari Sundell [-- Attachment #2: extend_mapping.diff --] [-- Type: application/octet-stream, Size: 1887 bytes --] Index: libtorrent/src/data/socket_file.cc =================================================================== --- libtorrent/src/data/socket_file.cc (revision 827) +++ libtorrent/src/data/socket_file.cc (working copy) @@ -162,20 +162,27 @@ MemoryChunk SocketFile::create_chunk(uint64_t offset, uint32_t length, int prot, int flags) const { if (!is_open()) - throw internal_error("SocketFile::get_chunk() called on a closed file"); + throw internal_error("SocketFile::get_chunk() called on a closed file."); if (((prot & MemoryChunk::prot_read) && !is_readable()) || ((prot & MemoryChunk::prot_write) && !is_writable())) - throw storage_error("SocketFile::get_chunk() permission denied"); + throw storage_error("SocketFile::get_chunk() permission denied."); + uint64_t fileSize = size(); + // For some reason mapping beyond the extent of the file does not // cause mmap to complain, so we need to check manually here. - if (offset < 0 || length == 0 || offset > size() || offset + length > size()) + if (offset < 0 || length == 0 || offset > fileSize || offset + length > fileSize) return MemoryChunk(); - uint64_t align = offset % MemoryChunk::page_size(); + uint64_t align = offset % MemoryChunk::page_size(); + uint64_t mapLength = std::min(((length + align + MemoryChunk::page_size() - 1) / MemoryChunk::page_size()) * MemoryChunk::page_size(), + fileSize - (offset - align)); - char* ptr = (char*)mmap(NULL, length + align, prot, flags, m_fd, offset - align); + if (offset - align + mapLength != fileSize && (offset - align + mapLength) % MemoryChunk::page_size() != 0) + throw internal_error("SocketFile::create_chunk(...) Length not page aligned."); + + char* ptr = (char*)mmap(NULL, mapLength, prot, flags, m_fd, offset - align); if (ptr == MAP_FAILED) return MemoryChunk(); ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-26 19:26 ` Linus Torvalds 2006-12-27 12:32 ` Jari Sundell @ 2006-12-27 12:44 ` valdyn 2006-12-27 13:33 ` Jari Sundell 2007-01-07 2:06 ` Tom Lanyon 2 siblings, 1 reply; 311+ messages in thread From: valdyn @ 2006-12-27 12:44 UTC (permalink / raw) To: linux-kernel Cc: Nick Piggin, Andrei Popa, Peter Zijlstra, David S. Miller, Andrew Morton, Gordon Farquharson, Martin Michlmayr, Hugh Dickins, Arjan van de Ven, Linux Kernel Mailing List, Linus Torvalds On Tue, Dec 26, 2006 at 11:26:50AM -0800, Linus Torvalds wrote: > What would also actually be interesting is whether somebody can reproduce > this on Reiserfs, for example. I _think_ all the reports I've seen are on > ext2 or ext3, and if this is somehow writeback-related, it could be some > bug that is just shared between the two by virtue of them still having a > lot of stuff in common. > > Linus I do get this error on reiserfs ( old one, didn't try on reiser4 ). Stock 2.6.19 plus reiser4 patch. Previously reported by me only in the debian bts. flo attenberger --- Linux master 2.6.19 #1 PREEMPT Thu Dec 21 10:55:34 CET 2006 x86_64 GNU/Linux # # Automatically generated make config: don't edit # Linux kernel version: 2.6.19 # Thu Dec 21 10:45:05 2006 # CONFIG_X86_64=y CONFIG_64BIT=y CONFIG_X86=y CONFIG_ZONE_DMA32=y CONFIG_LOCKDEP_SUPPORT=y CONFIG_STACKTRACE_SUPPORT=y CONFIG_SEMAPHORE_SLEEPERS=y CONFIG_MMU=y CONFIG_RWSEM_GENERIC_SPINLOCK=y CONFIG_GENERIC_HWEIGHT=y CONFIG_GENERIC_CALIBRATE_DELAY=y CONFIG_X86_CMPXCHG=y CONFIG_EARLY_PRINTK=y CONFIG_GENERIC_ISA_DMA=y CONFIG_GENERIC_IOMAP=y CONFIG_ARCH_MAY_HAVE_PC_FDC=y CONFIG_ARCH_POPULATES_NODE_MAP=y CONFIG_DMI=y CONFIG_AUDIT_ARCH=y CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config" # # Code maturity level options # CONFIG_EXPERIMENTAL=y CONFIG_BROKEN_ON_SMP=y CONFIG_LOCK_KERNEL=y CONFIG_INIT_ENV_ARG_LIMIT=32 # # General setup # CONFIG_LOCALVERSION="" CONFIG_LOCALVERSION_AUTO=y CONFIG_SWAP=y CONFIG_SYSVIPC=y # CONFIG_IPC_NS is not set CONFIG_POSIX_MQUEUE=y CONFIG_BSD_PROCESS_ACCT=y # CONFIG_BSD_PROCESS_ACCT_V3 is not set # CONFIG_TASKSTATS is not set # CONFIG_UTS_NS is not set # CONFIG_AUDIT is not set CONFIG_IKCONFIG=y CONFIG_IKCONFIG_PROC=y # CONFIG_RELAY is not set CONFIG_INITRAMFS_SOURCE="" # CONFIG_CC_OPTIMIZE_FOR_SIZE is not set CONFIG_SYSCTL=y # CONFIG_EMBEDDED is not set CONFIG_UID16=y CONFIG_SYSCTL_SYSCALL=y CONFIG_KALLSYMS=y CONFIG_KALLSYMS_ALL=y # CONFIG_KALLSYMS_EXTRA_PASS is not set CONFIG_HOTPLUG=y CONFIG_PRINTK=y CONFIG_BUG=y CONFIG_ELF_CORE=y CONFIG_BASE_FULL=y CONFIG_FUTEX=y CONFIG_EPOLL=y CONFIG_SHMEM=y CONFIG_SLAB=y CONFIG_VM_EVENT_COUNTERS=y CONFIG_RT_MUTEXES=y # CONFIG_TINY_SHMEM is not set CONFIG_BASE_SMALL=0 # CONFIG_SLOB is not set # # Loadable module support # CONFIG_MODULES=y CONFIG_MODULE_UNLOAD=y CONFIG_MODULE_FORCE_UNLOAD=y CONFIG_MODVERSIONS=y # CONFIG_MODULE_SRCVERSION_ALL is not set CONFIG_KMOD=y # # Block layer # CONFIG_BLOCK=y # CONFIG_LBD is not set # CONFIG_BLK_DEV_IO_TRACE is not set # CONFIG_LSF is not set # # IO Schedulers # CONFIG_IOSCHED_NOOP=y CONFIG_IOSCHED_AS=m CONFIG_IOSCHED_DEADLINE=m CONFIG_IOSCHED_CFQ=y # CONFIG_DEFAULT_AS is not set # CONFIG_DEFAULT_DEADLINE is not set CONFIG_DEFAULT_CFQ=y # CONFIG_DEFAULT_NOOP is not set CONFIG_DEFAULT_IOSCHED="cfq" # # Processor type and features # CONFIG_X86_PC=y # CONFIG_X86_VSMP is not set CONFIG_MK8=y # CONFIG_MPSC is not set # CONFIG_GENERIC_CPU is not set CONFIG_X86_L1_CACHE_BYTES=64 CONFIG_X86_L1_CACHE_SHIFT=6 CONFIG_X86_INTERNODE_CACHE_BYTES=64 CONFIG_X86_TSC=y CONFIG_X86_GOOD_APIC=y CONFIG_MICROCODE=m CONFIG_MICROCODE_OLD_INTERFACE=y CONFIG_X86_MSR=m CONFIG_X86_CPUID=m CONFIG_X86_IO_APIC=y CONFIG_X86_LOCAL_APIC=y CONFIG_MTRR=y # CONFIG_SMP is not set # CONFIG_PREEMPT_NONE is not set # CONFIG_PREEMPT_VOLUNTARY is not set CONFIG_PREEMPT=y CONFIG_PREEMPT_BKL=y CONFIG_ARCH_SPARSEMEM_ENABLE=y CONFIG_ARCH_FLATMEM_ENABLE=y CONFIG_SELECT_MEMORY_MODEL=y CONFIG_FLATMEM_MANUAL=y # CONFIG_DISCONTIGMEM_MANUAL is not set # CONFIG_SPARSEMEM_MANUAL is not set CONFIG_FLATMEM=y CONFIG_FLAT_NODE_MEM_MAP=y # CONFIG_SPARSEMEM_STATIC is not set CONFIG_SPLIT_PTLOCK_CPUS=4 CONFIG_RESOURCES_64BIT=y CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y CONFIG_HPET_TIMER=y CONFIG_IOMMU=y # CONFIG_CALGARY_IOMMU is not set CONFIG_SWIOTLB=y CONFIG_X86_MCE=y # CONFIG_X86_MCE_INTEL is not set CONFIG_X86_MCE_AMD=y CONFIG_KEXEC=y # CONFIG_CRASH_DUMP is not set CONFIG_PHYSICAL_START=0x200000 CONFIG_SECCOMP=y # CONFIG_CC_STACKPROTECTOR is not set # CONFIG_HZ_100 is not set # CONFIG_HZ_250 is not set CONFIG_HZ_1000=y CONFIG_HZ=1000 CONFIG_REORDER=y CONFIG_K8_NB=y CONFIG_GENERIC_HARDIRQS=y CONFIG_GENERIC_IRQ_PROBE=y CONFIG_ISA_DMA_API=y # # Power management options # CONFIG_PM=y CONFIG_PM_LEGACY=y # CONFIG_PM_DEBUG is not set CONFIG_PM_SYSFS_DEPRECATED=y # CONFIG_SOFTWARE_SUSPEND is not set # # ACPI (Advanced Configuration and Power Interface) Support # CONFIG_ACPI=y CONFIG_ACPI_SLEEP=y CONFIG_ACPI_SLEEP_PROC_FS=y # CONFIG_ACPI_SLEEP_PROC_SLEEP is not set CONFIG_ACPI_AC=m # CONFIG_ACPI_BATTERY is not set CONFIG_ACPI_BUTTON=m CONFIG_ACPI_VIDEO=m CONFIG_ACPI_HOTKEY=m CONFIG_ACPI_FAN=m # CONFIG_ACPI_DOCK is not set CONFIG_ACPI_PROCESSOR=m CONFIG_ACPI_THERMAL=m # CONFIG_ACPI_ASUS is not set # CONFIG_ACPI_IBM is not set # CONFIG_ACPI_TOSHIBA is not set CONFIG_ACPI_BLACKLIST_YEAR=0 # CONFIG_ACPI_DEBUG is not set CONFIG_ACPI_EC=y CONFIG_ACPI_POWER=y CONFIG_ACPI_SYSTEM=y CONFIG_X86_PM_TIMER=y # CONFIG_ACPI_CONTAINER is not set # CONFIG_ACPI_SBS is not set # # CPU Frequency scaling # CONFIG_CPU_FREQ=y CONFIG_CPU_FREQ_TABLE=m # CONFIG_CPU_FREQ_DEBUG is not set CONFIG_CPU_FREQ_STAT=m CONFIG_CPU_FREQ_STAT_DETAILS=y # CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE is not set CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE=y CONFIG_CPU_FREQ_GOV_PERFORMANCE=m CONFIG_CPU_FREQ_GOV_POWERSAVE=m CONFIG_CPU_FREQ_GOV_USERSPACE=y CONFIG_CPU_FREQ_GOV_ONDEMAND=m CONFIG_CPU_FREQ_GOV_CONSERVATIVE=m # # CPUFreq processor drivers # CONFIG_X86_POWERNOW_K8=m CONFIG_X86_POWERNOW_K8_ACPI=y # CONFIG_X86_SPEEDSTEP_CENTRINO is not set CONFIG_X86_ACPI_CPUFREQ=m # # shared options # # CONFIG_X86_ACPI_CPUFREQ_PROC_INTF is not set # CONFIG_X86_SPEEDSTEP_LIB is not set # # Bus options (PCI etc.) # CONFIG_PCI=y CONFIG_PCI_DIRECT=y CONFIG_PCI_MMCONFIG=y # CONFIG_PCIEPORTBUS is not set # CONFIG_PCI_MSI is not set # CONFIG_PCI_DEBUG is not set # CONFIG_HT_IRQ is not set # # PCCARD (PCMCIA/CardBus) support # # CONFIG_PCCARD is not set # # PCI Hotplug Support # CONFIG_HOTPLUG_PCI=m CONFIG_HOTPLUG_PCI_FAKE=m # CONFIG_HOTPLUG_PCI_ACPI is not set # CONFIG_HOTPLUG_PCI_CPCI is not set # CONFIG_HOTPLUG_PCI_SHPC is not set # # Executable file formats / Emulations # CONFIG_BINFMT_ELF=y CONFIG_BINFMT_MISC=m CONFIG_IA32_EMULATION=y # CONFIG_IA32_AOUT is not set CONFIG_COMPAT=y CONFIG_SYSVIPC_COMPAT=y # # Networking # CONFIG_NET=y # # Networking options # # CONFIG_NETDEBUG is not set CONFIG_PACKET=m CONFIG_PACKET_MMAP=y CONFIG_UNIX=y CONFIG_XFRM=y CONFIG_XFRM_USER=m # CONFIG_XFRM_SUB_POLICY is not set CONFIG_NET_KEY=m CONFIG_INET=y CONFIG_IP_MULTICAST=y CONFIG_IP_ADVANCED_ROUTER=y CONFIG_ASK_IP_FIB_HASH=y # CONFIG_IP_FIB_TRIE is not set CONFIG_IP_FIB_HASH=y CONFIG_IP_MULTIPLE_TABLES=y CONFIG_IP_ROUTE_FWMARK=y CONFIG_IP_ROUTE_MULTIPATH=y # CONFIG_IP_ROUTE_MULTIPATH_CACHED is not set CONFIG_IP_ROUTE_VERBOSE=y # CONFIG_IP_PNP is not set CONFIG_NET_IPIP=m CONFIG_NET_IPGRE=m # CONFIG_NET_IPGRE_BROADCAST is not set CONFIG_IP_MROUTE=y CONFIG_IP_PIMSM_V1=y CONFIG_IP_PIMSM_V2=y CONFIG_ARPD=y CONFIG_SYN_COOKIES=y CONFIG_INET_AH=m CONFIG_INET_ESP=m CONFIG_INET_IPCOMP=m CONFIG_INET_XFRM_TUNNEL=m CONFIG_INET_TUNNEL=m # CONFIG_INET_XFRM_MODE_TRANSPORT is not set # CONFIG_INET_XFRM_MODE_TUNNEL is not set # CONFIG_INET_XFRM_MODE_BEET is not set CONFIG_INET_DIAG=m CONFIG_INET_TCP_DIAG=m CONFIG_TCP_CONG_ADVANCED=y CONFIG_TCP_CONG_BIC=y CONFIG_TCP_CONG_CUBIC=m CONFIG_TCP_CONG_WESTWOOD=m CONFIG_TCP_CONG_HTCP=m # CONFIG_TCP_CONG_HSTCP is not set # CONFIG_TCP_CONG_HYBLA is not set # CONFIG_TCP_CONG_VEGAS is not set # CONFIG_TCP_CONG_SCALABLE is not set # CONFIG_TCP_CONG_LP is not set # CONFIG_TCP_CONG_VENO is not set CONFIG_DEFAULT_BIC=y # CONFIG_DEFAULT_CUBIC is not set # CONFIG_DEFAULT_HTCP is not set # CONFIG_DEFAULT_VEGAS is not set # CONFIG_DEFAULT_WESTWOOD is not set # CONFIG_DEFAULT_RENO is not set CONFIG_DEFAULT_TCP_CONG="bic" # # IP: Virtual Server Configuration # # CONFIG_IP_VS is not set CONFIG_IPV6=m CONFIG_IPV6_PRIVACY=y # CONFIG_IPV6_ROUTER_PREF is not set CONFIG_INET6_AH=m CONFIG_INET6_ESP=m CONFIG_INET6_IPCOMP=m # CONFIG_IPV6_MIP6 is not set CONFIG_INET6_XFRM_TUNNEL=m CONFIG_INET6_TUNNEL=m # CONFIG_INET6_XFRM_MODE_TRANSPORT is not set # CONFIG_INET6_XFRM_MODE_TUNNEL is not set # CONFIG_INET6_XFRM_MODE_BEET is not set # CONFIG_INET6_XFRM_MODE_ROUTEOPTIMIZATION is not set CONFIG_IPV6_SIT=m # CONFIG_IPV6_TUNNEL is not set # CONFIG_IPV6_MULTIPLE_TABLES is not set # CONFIG_NETLABEL is not set # CONFIG_NETWORK_SECMARK is not set CONFIG_NETFILTER=y # CONFIG_NETFILTER_DEBUG is not set # # Core Netfilter Configuration # CONFIG_NETFILTER_NETLINK=m CONFIG_NETFILTER_NETLINK_QUEUE=m CONFIG_NETFILTER_NETLINK_LOG=m CONFIG_NETFILTER_XTABLES=m CONFIG_NETFILTER_XT_TARGET_CLASSIFY=m CONFIG_NETFILTER_XT_TARGET_CONNMARK=m # CONFIG_NETFILTER_XT_TARGET_DSCP is not SCSI low-level drivers # # CONFIG_ISCSI_TCP is not set # CONFIG_BLK_DEV_3W_XXXX_RAID is not set # CONFIG_SCSI_3W_9XXX is not set # CONFIG_SCSI_ACARD is not set # CONFIG_SCSI_AACRAID is not set # CONFIG_SCSI_AIC7XXX is not set # CONFIG_SCSI_AIC7XXX_OLD is not set # CONFIG_SCSI_AIC79XX is not set # CONFIG_SCSI_AIC94XX is not set # CONFIG_SCSI_ARCMSR is not set # CONFIG_MEGARAID_NEWGEN is not set # CONFIG_MEGARAID_LEGACY is not set # CONFIG_MEGARAID_SAS is not set # CONFIG_SCSI_HPTIOP is not set # CONFIG_SCSI_BUSLOGIC is not set # CONFIG_SCSI_DMX3191D is not set # CONFIG_SCSI_EATA is not set # CONFIG_SCSI_FUTURE_DOMAIN is not set # CONFIG_SCSI_GDTH is not set # CONFIG_SCSI_IPS is not set # CONFIG_SCSI_INITIO is not set # CONFIG_SCSI_INIA100 is not set # CONFIG_SCSI_PPA is not set # CONFIG_SCSI_IMM is not set # CONFIG_SCSI_STEX is not set # CONFIG_SCSI_SYM53C8XX_2 is not set # CONFIG_SCSI_IPR is not set # CONFIG_SCSI_QLOGIC_1280 is not set # CONFIG_SCSI_QLA_FC is not set # CONFIG_SCSI_QLA_ISCSI is not set # CONFIG_SCSI_LPFC is not set # CONFIG_SCSI_DC395x is not set # CONFIG_SCSI_DC390T is not set # CONFIG_SCSI_DEBUG is not set # # Serial ATA (prod) and Parallel ATA (experimental) drivers # CONFIG_ATA=y # CONFIG_SATA_AHCI is not set # CONFIG_SATA_SVW is not set # CONFIG_ATA_PIIX is not set # CONFIG_SATA_MV is not set # CONFIG_SATA_NV is not set # CONFIG_PDC_ADMA is not set # CONFIG_SATA_QSTOR is not set CONFIG_SATA_PROMISE=m # CONFIG_SATA_SX4 is not set # CONFIG_SATA_SIL is not set # CONFIG_SATA_SIL24 is not set # CONFIG_SATA_SIS is not set # CONFIG_SATA_ULI is not set CONFIG_SATA_VIA=y # CONFIG_SATA_VITESSE is not set # CONFIG_PATA_ALI is not set # CONFIG_PATA_AMD is not set # CONFIG_PATA_ARTOP is not set # CONFIG_PATA_ATIIXP is not set # CONFIG_PATA_CMD64X is not set # CONFIG_PATA_CS5520 is not set # CONFIG_PATA_CS5530 is not set # CONFIG_PATA_CYPRESS is not set # CONFIG_PATA_EFAR is not set # CONFIG_ATA_GENERIC is not set # CONFIG_PATA_HPT366 is not set # CONFIG_PATA_HPT37X is not set # CONFIG_PATA_HPT3X2N is not set # C# CONFIG_NETPOLL is not set # CONFIG_NET_POLL_CONTROLLER is not set # # ISDN subsystem # CONFIG_ISDN=m # # Old ISDN4Linux # CONFIG_ISDN_I4L=m CONFIG_ISDN_PPP=y CONFIG_ISDN_PPP_VJ=y CONFIG_ISDN_MPP=y # CONFIG_IPPP_FILTER is not set CONFIG_ISDN_PPP_BSDCOMP=m CONFIG_ISDN_AUDIO=y CONFIG_ISDN_TTY_FAX=y # # ISDN feature submodules # # CONFIG_ISDN_DRV_LOOP is not set CONFIG_ISDN_DIVERSION=m # # ISDN4Linux hardware drivers # # # Passive cards # CONFIG_ISDN_DRV_HISAX=m # # D-channel protocol features # CONFIG_HISAX_EURO=y CONFIG_DE_AOC=y # CONFIG_HISAX_NO_SENDCOMPLETE is not set # CONFIG_HISAX_NO_LLC is not set # CONFIG_HISAX_NO_KEYPAD is not set # CONFIG_HISAX_1TR6 is not set # CONFIG_HISAX_NI1 is not set CONFIG_HISAX_MAX_CARDS=8 # # HiSax supported cards # # CONFIG_HISAX_16_3 is not set # CONFIG_HISAX_TELESPCI is not set # CONFIG_HISAX_S0BOX is not set CONFIG_HISAX_FRITZPCI=y # CONFIG_HISAX_AVM_A1_PCMCIA is not set # CONFIG_HISAX_ELSA is not set # CONFIG_HISAX_DIEHLDIVA is not set # CONFIG_HISAX_SEDLBAUER is not set # CONFIG_HISAX_NETJET is not set # CONFIG_HISAX_NETJET_U is not set # CONFIG_HISAX_NICCY is not set # CONFIG_HISAX_BKM_A4T is not set # CONFIG_HISAX_SCT_QUADRO is not set # CONFIG_HISAX_GAZEL is not set # CONFIG_HISAX_HFC_PCI is not set # CONFIG_HISAX_W6692 is not set # CONFIG_HISAX_HFC_SX is not set # CONFIG_HISAX_DEBUG is not set # # HiSax PCMCIA card service modules # # # HiSax sub driver modules # # CONFIG_HISAX_ST5481 is not set # CONFIG_HISAX_HFCUSB is not set # CONFIG_HISAX_HFC4S8S is not set CONFIG_HISAX_FRITZ_PCIPNP=m # # Active cards # # CONFIG_HYSDN is not set # # Siemens Gigaset # # CONFIG_ISDN_DRV_GIGASET is not set # # CAPI subsystem # CONFIG_ISDN_CAPI=m # CONFIG_ISDN_DRV_AVMB1_VERBOSE_REASON is not set CONFIG_ISDN_CAPI_MIDDLEWARE=y CONFIG_ISDN_CAPI_CAPI20=m CONFIG_ISDN_CAPI_CAPIFS_BOOL=y CONFIG_ISDN_CAPI_CAPIFS=m # CONFIG_ISDN_CAPI_CAPIDRV is not set # # CAPI hardware drivers # # # Active AVM cards # # CONFIG_CAPI_AVM is not set # # Active Eicon DIVA Server cards # # CONFIG_CAPI_EICON is not set # # Telephony Support # # CONFIG_PHONE is not set # # Input device support # CONFIG_INPUT=y # CONFIG_INPUT_FF_MEMLESS is not set # # Usepport # # CONFIG_SPI is not set # CONFIG_SPI_MASTER is not set # # Dallas's 1-wire bus # CONFIG_W1=m # # 1-wire Bus Masters # # CONFIG_W1_MASTER_MATROX is not set # CONFIG_W1_MASTER_DS2490 is not set # CONFIG_W1_MASTER_DS2482 is not set # # 1-wire Slaves # CONFIG_W1_SLAVE_THERM=m CONFIG_W1_SLAVE_SMEM=m CONFIG_W1_SLAVE_DS2433=m # CONFIG_W1_SLAVE_DS2433_CRC is not set # # Hardware Monitoring support # CONFIG_HWMON=m CONFIG_HWMON_VID=m # CONFIG_SENSORS_ABITUGURU is not set CONFIG_SENSORS_ADM1021=m CONFIG_SENSORS_ADM1025=m CONFIG_SENSORS_ADM1026=m CONFIG_SENSORS_ADM1031=m CONFIG_SENSORS_ADM9240=m CONFIG_SENSORS_K8TEMP=m CONFIG_SENSORS_ASB100=m # CONFIG_SENSORS_ATXP1 is not set CONFIG_SENSORS_DS1621=m # CONFIG_SENSORS_F71805F is not set CONFIG_SENSORS_FSCHER=m CONFIG_SENSORS_FSCPOS=m CONFIG_SENSORS_GL518SM=m CONFIG_SENSORS_GL520SM=m CONFIG_SENSORS_IT87=m CONFIG_SENSORS_LM63=m CONFIG_SENSORS_LM75=m CONFIG_SENSORS_LM77=m CONFIG_SENSORS_LM78=m CONFIG_SENSORS_LM80=m CONFIG_SENSORS_LM83=m CONFIG_SENSORS_LM85=m CONFIG_SENSORS_LM87=m CONFIG_SENSORS_LM90=m CONFIG_SENSORS_LM92=m CONFIG_SENSORS_MAX1619=m CONFIG_SENSORS_PC87360=m CONFIG_SENSORS_SIS5595=m CONFIG_SENSORS_SMSC47M1=m # CONFIG_SENSORS_SMSC47M192 is not set CONFIG_SENSORS_SMSC47B397=m CONFIG_SENSORS_VIA686A=m CONFIG_SENSORS_VT1211=m CONFIG_SENSORS_VT8231=m CONFIG_SENSORS_W83781D=m # CONFIG_SENSORS_W83791D is not set CONFIG_SENSORS_W83792D=m CONFIG_SENSORS_W83L785TS=m CONFIG_SENSORS_W83627HF=m CONFIG_SENSORS_W83627EHF=m # CONFIG_SENSORS_HDAPS is not set # CONFIG_HWMON_DEBUG_CHIP is not set # # Multimedia devices # CONFIG_VIDEO_DEV=m CONFIG_VIDEO_V4L1=y CONFIG_VIDEO_V4L1_COMPAT=y CONFIG_VIDEO_V4L2=y # # Video Capture Adapters # # # Video Capture Adapters # # CONFIG_VIDEO_ADV_DEBUG is not set CONFIG_VIDEO_HELPER_CHIPS_AUTO=y CONFIG_VIDEO_TVAUDIO=m CONFIG_VIDEO_TDA7432=m CONFIG_VIDEO_TDA9875=m CONFIG_VIDEO_MSP3400=m # CONFIG_VIDEO_VIVI is not set CONFIG_VIDEO_BT848=m # CONFIG_VIDEO_BT848_DVB is not set CONFIG_VIDEO_SAA6588=m # CONFIG_VIDEO_BWQCAM is not set # CONFIG_VIDEO_CQCAM is not set # CONFIG_VIDEO_W9966 is not set # CONFIG_VIDEO_CPIA is not set # CONFIG_VIDEO_CPIA2 is not set CONFIG_VIDEO_SAA5246A=m CONFIG_VIDEO__UART=m CONFIG_SND_AC97_CODEC=m CONFIG_SND_AC97_BUS=m CONFIG_SND_DUMMY=m CONFIG_SND_VIRMIDI=m # CONFIG_SND_MTPAV is not set # CONFIG_SND_MTS64 is not set # CONFIG_SND_SERIAL_U16550 is not set CONFIG_SND_MPU401=m # # PCI devices # # CONFIG_SND_AD1889 is not set # CONFIG_SND_ALS300 is not set # CONFIG_SND_ALS4000 is not set # CONFIG_SND_ALI5451 is not set # CONFIG_SND_ATIIXP is not set # CONFIG_SND_ATIIXP_MODEM is not set # CONFIG_SND_AU8810 is not set # CONFIG_SND_AU8820 is not set # CONFIG_SND_AU8830 is not set # CONFIG_SND_AZT3328 is not set CONFIG_SND_BT87X=m # CONFIG_SND_BT87X_OVERCLOCK is not set # CONFIG_SND_CA0106 is not set # CONFIG_SND_CMIPCI is not set # CONFIG_SND_CS4281 is not set # CONFIG_SND_CS46XX is not set # CONFIG_SND_DARLA20 is not set # CONFIG_SND_GINA20 is not set # CONFIG_SND_LAYLA20 is not set # CONFIG_SND_DARLA24 is not set # CONFIG_SND_GINA24 is not set # CONFIG_SND_LAYLA24 is not set # CONFIG_SND_MONA is not set # CONFIG_SND_MIA is not set # CONFIG_SND_ECHO3G is not set # CONFIG_SND_INDIGO is not set # CONFIG_SND_INDIGOIO is not set # CONFIG_SND_INDIGODJ is not set # CONFIG_SND_EMU10K1 is not set # CONFIG_SND_EMU10K1X is not set # CONFIG_SND_ENS1370 is not set CONFIG_SND_ENS1371=m # CONFIG_SND_ES1938 is not set # CONFIG_SND_ES1968 is not set # CONFIG_SND_FM801 is not set # CONFIG_SND_HDA_INTEL is not set # CONFIG_SND_HDSP is not set # CONFIG_SND_HDSPM is not set # CONFIG_SND_ICE1712 is not set # CONFIG_SND_ICE1724 is not set # CONFIG_SND_INTEL8X0 is not set # CONFIG_SND_INTEL8X0M is not set # CONFIG_SND_KORG1212 is not set # CONFIG_SND_MAESTRO3 is not set # CONFIG_SND_MIXART is not set # CONFIG_SND_NM256 is not set # CONFIG_SND_PCXHR is not set # CONFIG_SND_RIPTIDE is not set # CONFIG_SND_RME32 is not set # CONFIG_SND_RME96 is not set # CONFIG_SND_RME9652 is not set # CONFIG_SND_SONICVIBES is not set # CONFIG_SND_TRIDENT is not set CONFIG_SND_VIA82XX=m # CONFIG_SND_VIA82XX_MODEM is not set # CONFIG_SND_VX222 is not set # CONFIG_SND_YMFPCI is not set CONFIG_SND_AC97_POWER_SAVE=y # # USB devices # CONFIG_SND_USB_AUDIO=m # CONFIG_SND_USB_USX2Y is not set # # Open Sound System # # CONFIG_SOUND_PRIME is not set # # USB support # CONFIG_USB_ARCH_HAS_HCD=y CONFIG_USB_ARCH_HAS_OHCI=y CONFIG_USB_ARCH_HAS_EHCI=y CONFIG_USB=m # CONFIG_USB_DEBUG is not set # # Miscellaneous USB options # CONFIG_USB_DEVICEFS=y # CONFIG_USB_BANDWIDTH is not set # CONFIG_USB_DYNAMIC_MINORS is not set # CONFIG_USB_SUSPEND is not set # CONFIG_USB_OTG is not set # # USB Host Controller Drivers # CONFIG_USB_EHCI_HCD=m CONFIG_USB_EHCI_SPLIT_ISO=y CONFIG_USB_EHCI_ROOT_HUB_TT=y # CONFIG_USB_EHCI_TT_NEWSCHED is not set # CONFIG_USB_ISP116X_HCD is not set CONFIG_USB_OHCI_HCD=m # CONFIG_USB_OHCI_BIG_ENDIAN is not set CONFIG_USB_OHCI_LITTLE_ENDIAN=y CONFIG_USB_UHCI_HCD=m # CONFIG_USB_SL811_HCD is not set # # USB Device Class drivers # # CONFIG_USB_ACM is not set CONFIG_USB_PRINTER=m # # NOTE: USB_STORAGE enables SCSI, and 'SCSI disk support' # # # may also be needed; see USB_STORAGE Help for more information # CONFIG_USB_STORAGE=m # CONFIG_USB_STORAGE_DEBUG is not set # CONFIG_USB_STORAGE_DATAFAB is not set # CONFIG_USB_STORAGE_FREECOM is not set # CONFIG_USB_STORAGE_DPCM is not set # CONFIG_USB_STORAGE_USBAT is not set # CONFIG_USB_STORAGE_SDDR09 is not set # CONFIG_USB_STORAGE_SDDR55 is not set # CONFIG_USB_STORAGE_JUMPSHOT is not set # CONFIG_USB_STORAGE_ALAUDA is not set # CONFIG_USB_STORAGE_KARMA is not set # CONFIG_USB_LIBUSUAL is not set # # USB Input Devices # CONFIG_USB_HID=m CONFIG_USB_HIDINPUT=y # CONFIG_USB_HIDINPUT_POWERBOOK is not set # CONFIG_HID_FF is not set CONFIG_USB_HIDDEV=y # # USB HID Boot Protocol drivers # # CONFIG_USB_KBD is not set # CONFIG_USB_MOUSE is not set # CONFIG_USB_AIPTEK is not set # CONFIG_USB_WACOM is not set # CONFIG_USB_ACECAD is not set # CONFIG_USB_KBTAB is not set # CONFIG_USB_POWERMATE is not set # CONFIG_USB_TOUCHSCREEN is not set # CONFIG_USB_YEALINK is not set # CONFIG_USB_XPAD is not set # CONFIG_USB_ATI_REMOTE is not set # CONFIG_USB_ATI_REMOTE2 is not set # CONFIG_USB_KEYSPAN_REMOTE is not set # CONFIG_USB_APPLETOUCH is not set # # USB Imaging devices # # CONFIG_USB_MDC800 is not set # CONFIG_USB_MICROTEK is not set # # USB Network Adapters # # CONFIG_USB_CATC is not set # CONFIG_USB_KAWETH is not set # CONFIG_USB_PEGASUS is not set # CONFIG_USB_RTL8150 is not set # CONFIG_USB_USBNET_MII is not set # CONFIG_USB_USBNET is not set # CONFIG_USB_MON is not set # # USB port drivers # # CONFIG_USB_USS720 is not set # # USB Serial Converter support # # CONFIG_USB_SERIAL is not set # # USB Miscellaneous drivers # # CONFIG_USB_EMI62 is not set # CONFIG_USB_EMI26 is not set # CONFIG_USB_ADUTUX is not set # CONFIG_USB_AUERSWALD is not set # CONFIG_USB_RIO500 is not set # CONFIG_USB_LEGOTOWER is not set # CONFIG_USB_LCD is not set # CONFIG_USB_LED is not set # CONFIG_USB_CYPRESS_CY7C63 is not set # CONFIG_USB_CYTHERM is not set # CONFIG_USB_PHIDGET is not set # CONFIG_USB_IDMOUSE is not set # CONFIG_USB_FTDI_ELAN is not set # CONFIG_USB_APPLEDISPLAY is not set # CONFIG_USB_SISUSBVGA is not set # CONFIG_USB_LD is not set # CONFIG_USB_TRANCEVIBRATOR is not set # CONFIG_USB_TEST is not set # # USB DSL modem support # # # USB Gadget Support # # CONFIG_USB_GADGET is not set # # MMC/SD Card support # # CONFIG_MMC is not set # # LED devices # # CONFIG_NEW_LEDS is not set # # LED drivers # # # LED Triggers # # # InfiniBand support # # CONFIG_INFINIBAND is not set # # EDAC - error detection and reporting (RAS) (EXPERIMENTAL) # # CONFIG_EDAC is not set # # Real Time Clock # CONFIG_RTC_LIB=m CONFIG_RTC_CLASS=m # # RTC interfaces # CONFIG_RTC_INTF_SYSFS=m CONFIG_RTC_INTF_PROC=m CONFIG_RTC_INTF_DEV=m # CONFIG_RTC_INTF_DEV_UIE_EMUL is not set # # RTC drivers # CONFIG_RTC_DRV_X1205=m CONFIG_RTC_DRV_DS1307=m CONFIG_RTC_DRV_DS1553=m CONFIG_RTC_DRV_ISL1208=m CONFIG_RTC_DRV_DS1672=m CONFIG_RTC_DRV_DS1742=m CONFIG_RTC_DRV_PCF8563=m CONFIG_RTC_DRV_PCF8583=m CONFIG_RTC_DRV_RS5C372=m CONFIG_RTC_DRV_M48T86=m CONFIG_RTC_DRV_TEST=m CONFIG_RTC_DRV_V3020=m # # DMA Engine support # # CONFIG_DMA_ENGINE is not set # # DMA Clients # # # DMA Devices # # # Firmware Drivers # # CONFIG_EDD is not set # CONFIG_DELL_RBU is not set # CONFIG_DCDBAS is not set # # File systems # CONFIG_EXT2_FS=y # CONFIG_EXT2_FS_XATTR is not set # CONFIG_EXT2_FS_XIP is not set CONFIG_EXT3_FS=m # CONFIG_EXT3_FS_XATTR is not set # CONFIG_EXT4DEV_FS is not set CONFIG_JBD=m # CONFIG_JBD_DEBUG is not set CONFIG_REISER4_FS=y # CONFIG_REISER4_DEBUG is not set CONFIG_REISERFS_FS=y # CONFIG_REISERFS_CHECK is not set # CONFIG_REISERFS_PROC_INFO is not set # CONFIG_REISERFS_FS_XATTR is not set # CONFIG_JFS_FS is not set CONFIG_FS_POSIX_ACL=y # CONFIG_XFS_FS is not set # CONFIG_GFS2_FS is not set # CONFIG_OCFS2_FS is not set CONFIG_MINIX_FS=m CONFIG_ROMFS_FS=m CONFIG_INOTIFY=y CONFIG_INOTIFY_USER=y # CONFIG_QUOTA is not set CONFIG_DNOTIFY=y CONFIG_AUTOFS_FS=m CONFIG_AUTOFS4_FS=m CONFIG_FUSE_FS=m # # CD-ROM/DVD Filesystems # CONFIG_ISO9660_FS=m CONFIG_JOLIET=y CONFIG_ZISOFS=y CONFIG_ZISOFS_FS=m CONFIG_UDF_FS=m CONFIG_UDF_NLS=y # # DOS/FAT/NT Filesystems # CONFIG_FAT_FS=m CONFIG_MSDOS_FS=m CONFIG_VFAT_FS=m CONFIG_FAT_DEFAULT_CODEPAGE=850 CONFIG_FAT_DEFAULT_IOCHARSET="iso8859-15" CONFIG_NTFS_FS=m # CONFIG_NTFS_DEBUG is not set CONFIG_NTFS_RW=y # # Pseudo filesystems # CONFIG_PROC_FS=y CONFIG_PROC_KCORE=y CONFIG_PROC_SYSCTL=y CONFIG_SYSFS=y CONFIG_TMPFS=y # CONFIG_TMPFS_POSIX_ACL is not set # CONFIG_HUGETLBFS is not set # CONFIG_HUGETLB_PAGE is not set CONFIG_RAMFS=y # CONFIG_CONFIGFS_FS is not set # # Miscellaneous filesystems # # CONFIG_ADFS_FS is not set # CONFIG_AFFS_FS is not set # CONFIG_HFS_FS is not set # CONFIG_HFSPLUS_FS is not set # CONFIG_BEFS_FS is not set # CONFIG_BFS_FS is not set # CONFIG_EFS_FS is not set CONFIG_CRAMFS=m # CONFIG_VXFS_FS is not set # CONFIG_HPFS_FS is not set # CONFIG_QNX4FS_FS is not set # CONFIG_SYSV_FS is not set # CONFIG_UFS_FS is not set # # Network File Systems # CONFIG_NFS_FS=m CONFIG_NFS_V3=y # CONFIG_NFS_V3_ACL is not set CONFIG_NFS_V4=y # CONFIG_NFS_DIRECTIO is not set CONFIG_NFSD=m CONFIG_NFSD_V3=y # CONFIG_ONFIG_CRYPTO_HASH=y CONFIG_CRYPTO_MANAGER=y CONFIG_CRYPTO_HMAC=y CONFIG_CRYPTO_NULL=m CONFIG_CRYPTO_MD4=m CONFIG_CRYPTO_MD5=y CONFIG_CRYPTO_SHA1=m CONFIG_CRYPTO_SHA256=m CONFIG_CRYPTO_SHA512=m CONFIG_CRYPTO_WP512=m CONFIG_CRYPTO_TGR192=m CONFIG_CRYPTO_ECB=m CONFIG_CRYPTO_CBC=m CONFIG_CRYPTO_DES=m CONFIG_CRYPTO_BLOWFISH=m CONFIG_CRYPTO_TWOFISH=m CONFIG_CRYPTO_TWOFISH_COMMON=m CONFIG_CRYPTO_TWOFISH_X86_64=m CONFIG_CRYPTO_SERPENT=m CONFIG_CRYPTO_AES=m CONFIG_CRYPTO_AES_X86_64=m CONFIG_CRYPTO_CAST5=m CONFIG_CRYPTO_CAST6=m CONFIG_CRYPTO_TEA=m CONFIG_CRYPTO_ARC4=m CONFIG_CRYPTO_KHAZAD=m CONFIG_CRYPTO_ANUBIS=m CONFIG_CRYPTO_DEFLATE=m CONFIG_CRYPTO_MICHAEL_MIC=m CONFIG_CRYPTO_CRC32C=m CONFIG_CRYPTO_TEST=m # # Hardware crypto devices # # # Library routines # CONFIG_CRC_CCITT=m CONFIG_CRC16=m CONFIG_CRC32=m CONFIG_LIBCRC32C=m CONFIG_ZLIB_INFLATE=y CONFIG_ZLIB_DEFLATE=y CONFIG_TEXTSEARCH=y CONFIG_TEXTSEARCH_KMP=m CONFIG_TEXTSEARCH_BM=m CONFIG_TEXTSEARCH_FSM=m CONFIG_PLIST=y ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-27 12:44 ` valdyn @ 2006-12-27 13:33 ` Jari Sundell 0 siblings, 0 replies; 311+ messages in thread From: Jari Sundell @ 2006-12-27 13:33 UTC (permalink / raw) To: valdyn Cc: linux-kernel, Nick Piggin, Andrei Popa, Peter Zijlstra, David S. Miller, Andrew Morton, Gordon Farquharson, Martin Michlmayr, Hugh Dickins, Arjan van de Ven, Linus Torvalds On 12/27/06, valdyn@gmail.com <valdyn@gmail.com> wrote: > I do get this error on reiserfs ( old one, didn't try on reiser4 ). > Stock 2.6.19 plus reiser4 patch. Previously reported by me only in the > debian bts. I've had reports of corrupted data on earlier kernel releases with reiserfs3, which were fixed by upgrading to reiserfs4. Jari Sundell ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-26 19:26 ` Linus Torvalds 2006-12-27 12:32 ` Jari Sundell 2006-12-27 12:44 ` valdyn @ 2007-01-07 2:06 ` Tom Lanyon 2007-01-07 5:58 ` Tom Lanyon 2007-01-07 6:05 ` Andrew Morton 2 siblings, 2 replies; 311+ messages in thread From: Tom Lanyon @ 2007-01-07 2:06 UTC (permalink / raw) To: Linus Torvalds Cc: Nick Piggin, Andrei Popa, Peter Zijlstra, David S. Miller, Andrew Morton, Gordon Farquharson, Martin Michlmayr, Hugh Dickins, Arjan van de Ven, Linux Kernel Mailing List On 12/27/06, Linus Torvalds <torvalds@osdl.org> wrote: > What would also actually be interesting is whether somebody can reproduce > this on Reiserfs, for example. I _think_ all the reports I've seen are on > ext2 or ext3, and if this is somehow writeback-related, it could be some > bug that is just shared between the two by virtue of them still having a > lot of stuff in common. > > Linus I've been following this thread for a while now as I started experiencing file corruption in rtorrent when I upgraded to 2.6.19. I am using reiserfs. -- Tom Lanyon ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2007-01-07 2:06 ` Tom Lanyon @ 2007-01-07 5:58 ` Tom Lanyon 2007-01-07 6:05 ` Andrew Morton 1 sibling, 0 replies; 311+ messages in thread From: Tom Lanyon @ 2007-01-07 5:58 UTC (permalink / raw) To: Linus Torvalds Cc: Nick Piggin, Andrei Popa, Peter Zijlstra, David S. Miller, Andrew Morton, Gordon Farquharson, Martin Michlmayr, Hugh Dickins, Arjan van de Ven, Linux Kernel Mailing List On 1/7/07, Tom Lanyon <tomlanyon@gmail.com> wrote: > I've been following this thread for a while now as I started > experiencing file corruption in rtorrent when I upgraded to 2.6.19. I > am using reiserfs. However, moving to 2.6.20-rc3 does indeed seem to fix the issue thus far... -- Tom Lanyon ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2007-01-07 2:06 ` Tom Lanyon 2007-01-07 5:58 ` Tom Lanyon @ 2007-01-07 6:05 ` Andrew Morton 1 sibling, 0 replies; 311+ messages in thread From: Andrew Morton @ 2007-01-07 6:05 UTC (permalink / raw) To: Tom Lanyon Cc: Linus Torvalds, Nick Piggin, Andrei Popa, Peter Zijlstra, David S. Miller, Gordon Farquharson, Martin Michlmayr, Hugh Dickins, Arjan van de Ven, Linux Kernel Mailing List On Sun, 7 Jan 2007 12:36:18 +1030 "Tom Lanyon" <tomlanyon@gmail.com> wrote: > On 12/27/06, Linus Torvalds <torvalds@osdl.org> wrote: > > What would also actually be interesting is whether somebody can reproduce > > this on Reiserfs, for example. I _think_ all the reports I've seen are on > > ext2 or ext3, and if this is somehow writeback-related, it could be some > > bug that is just shared between the two by virtue of them still having a > > lot of stuff in common. > > > > Linus > > I've been following this thread for a while now as I started > experiencing file corruption in rtorrent when I upgraded to 2.6.19. I > am using reiserfs. reiserfs defaults to data=ordered, so it's quite possibly the same bug. ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-24 18:37 ` Linus Torvalds 2006-12-24 19:18 ` Linus Torvalds @ 2006-12-24 21:21 ` Michael S. Tsirkin 1 sibling, 0 replies; 311+ messages in thread From: Michael S. Tsirkin @ 2006-12-24 21:21 UTC (permalink / raw) To: Linus Torvalds Cc: Andrei Popa, Peter Zijlstra, Andrew Morton, Gordon Farquharson, Martin Michlmayr, Hugh Dickins, Nick Piggin, Arjan van de Ven, openib-general, Linux Kernel Mailing List > Quoting Linus Torvalds <torvalds@osdl.org>: > Subject: Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) > > Peter, tell me I'm crazy, but with the new rules, the following condition > is a bug: > > - shared mapping > - writable > - not already marked dirty in the PTE > > because that combination means that the hardware can mark the PTE dirty > without us even realizing (and thus not marking the "struct page *" > dirty). Er. Sorry about bumping in, and I'm not sure I understand all of the discussion, but this reminded me of an old issue with COW that created what looks like a vaguely similiar data corruption on infiniband. We solved this for infiniband with MADV_DONTFORK, but I always wondered why does it not affect other parts of kernel. Small reminder from that discussion: down mmap sem get user pages up mmap sem page becomes shared, and COW (e.g. fork) process writes to first byte of page <----- gets a copy Now we had a problem: struct page that we got from get user pages does not point to a correct page in our process. For example: if at some point we map this page for DMA, and hardware writes to last byte of page -----> process does not see this data. So for infiniband, what we do is a combination of - prevent page from becoming COW while hardware might DMA to this page, and - ask users not to write to page if hardware might DMA to same page (even if its using different bytes). I just wandered - is there some chance something like this could be happening in the fs code? HTH, -- MST ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-24 17:16 ` Linus Torvalds 2006-12-24 18:07 ` Andrew Morton 2006-12-24 18:37 ` Linus Torvalds @ 2006-12-24 19:27 ` Gordon Farquharson 2006-12-24 19:35 ` Linus Torvalds 2 siblings, 1 reply; 311+ messages in thread From: Gordon Farquharson @ 2006-12-24 19:27 UTC (permalink / raw) To: Linus Torvalds Cc: Andrei Popa, Andrew Morton, Martin Michlmayr, Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven, Linux Kernel Mailing List On 12/24/06, Linus Torvalds <torvalds@osdl.org> wrote: > How about this particularly stupid diff? (please test with something that > _would_ cause corruption normally). > > It is _entirely_ untested, but what it tries to do is to simply serialize > any writeback in progress with any process that tries to re-map a shared > page into its address space and dirty it. I haven't tested it, and maybe > it misses some case, but it looks likea good way to try to avoid races > with marking pages dirty and the writeback phase .. The apt cache files (/var/cache/apt/*.bin) still get corrupted with this patch and 2.6.19. Gordon diff -Naupr linux-2.6.19.orig/fs/buffer.c linux-2.6.19/fs/buffer.c --- linux-2.6.19.orig/fs/buffer.c 2006-11-29 14:57:37.000000000 -0700 +++ linux-2.6.19/fs/buffer.c 2006-12-21 01:16:31.000000000 -0700 @@ -2832,7 +2832,7 @@ int try_to_free_buffers(struct page *pag int ret = 0; BUG_ON(!PageLocked(page)); - if (PageWriteback(page)) + if (PageDirty(page) || PageWriteback(page)) return 0; if (mapping == NULL) { /* can this still happen? */ @@ -2843,17 +2843,6 @@ int try_to_free_buffers(struct page *pag spin_lock(&mapping->private_lock); ret = drop_buffers(page, &buffers_to_free); spin_unlock(&mapping->private_lock); - if (ret) { - /* - * If the filesystem writes its buffers by hand (eg ext3) - * then we can have clean buffers against a dirty page. We - * clean the page here; otherwise later reattachment of buffers - * could encounter a non-uptodate page, which is unresolvable. - * This only applies in the rare case where try_to_free_buffers - * succeeds but the page is not freed. - */ - clear_page_dirty(page); - } out: if (buffers_to_free) { struct buffer_head *bh = buffers_to_free; diff -Naupr linux-2.6.19.orig/fs/hugetlbfs/inode.c linux-2.6.19/fs/hugetlbfs/inode.c --- linux-2.6.19.orig/fs/hugetlbfs/inode.c 2006-11-29 14:57:37.000000000 -0700 +++ linux-2.6.19/fs/hugetlbfs/inode.c 2006-12-21 01:15:21.000000000 -0700 @@ -176,7 +176,7 @@ static int hugetlbfs_commit_write(struct static void truncate_huge_page(struct page *page) { - clear_page_dirty(page); + cancel_dirty_page(page, /* No IO accounting for huge pages? */0); ClearPageUptodate(page); remove_from_page_cache(page); put_page(page); diff -Naupr linux-2.6.19.orig/include/linux/page-flags.h linux-2.6.19/include/linux/page-flags.h --- linux-2.6.19.orig/include/linux/page-flags.h 2006-11-29 14:57:37.000000000 -0700 +++ linux-2.6.19/include/linux/page-flags.h 2006-12-21 01:15:21.000000000 -0700 @@ -253,15 +253,11 @@ static inline void SetPageUptodate(struc struct page; /* forward declaration */ -int test_clear_page_dirty(struct page *page); +extern void cancel_dirty_page(struct page *page, unsigned int account_size); + int test_clear_page_writeback(struct page *page); int test_set_page_writeback(struct page *page); -static inline void clear_page_dirty(struct page *page) -{ - test_clear_page_dirty(page); -} - static inline void set_page_writeback(struct page *page) { test_set_page_writeback(page); diff -Naupr linux-2.6.19.orig/mm/memory.c linux-2.6.19/mm/memory.c --- linux-2.6.19.orig/mm/memory.c 2006-11-29 14:57:37.000000000 -0700 +++ linux-2.6.19/mm/memory.c 2006-12-24 11:04:03.000000000 -0700 @@ -1534,6 +1534,7 @@ static int do_wp_page(struct mm_struct * if (!pte_same(*page_table, orig_pte)) goto unlock; } + wait_on_page_writeback(old_page); dirty_page = old_page; get_page(dirty_page); reuse = 1; @@ -1832,6 +1833,33 @@ void unmap_mapping_range(struct address_ } EXPORT_SYMBOL(unmap_mapping_range); +static void check_last_page(struct address_space *mapping, loff_t size) +{ + pgoff_t index; + unsigned int offset; + struct page *page; + + if (!mapping) + return; + offset = size & ~PAGE_MASK; + if (!offset) + return; + index = size >> PAGE_SHIFT; + page = find_lock_page(mapping, index); + if (page) { + unsigned int check = 0; + unsigned char *kaddr = kmap_atomic(page, KM_USER0); + do { + check += kaddr[offset++]; + } while (offset < PAGE_SIZE); + kunmap_atomic(kaddr,KM_USER0); + unlock_page(page); + page_cache_release(page); + if (check) + printk("%s: BADNESS: truncate check %u\n", current->comm, check); + } +} + /** * vmtruncate - unmap mappings "freed" by truncate() syscall * @inode: inode of the file used @@ -1865,6 +1893,7 @@ do_expand: goto out_sig; if (offset > inode->i_sb->s_maxbytes) goto out_big; + check_last_page(mapping, inode->i_size); i_size_write(inode, offset); out_truncate: @@ -2206,6 +2235,7 @@ retry: page_cache_release(new_page); return VM_FAULT_SIGBUS; } + wait_on_page_writeback(new_page); } } diff -Naupr linux-2.6.19.orig/mm/page-writeback.c linux-2.6.19/mm/page-writeback.c --- linux-2.6.19.orig/mm/page-writeback.c 2006-11-29 14:57:37.000000000 -0700 +++ linux-2.6.19/mm/page-writeback.c 2006-12-21 01:26:53.000000000 -0700 @@ -843,39 +843,6 @@ int set_page_dirty_lock(struct page *pag EXPORT_SYMBOL(set_page_dirty_lock); /* - * Clear a page's dirty flag, while caring for dirty memory accounting. - * Returns true if the page was previously dirty. - */ -int test_clear_page_dirty(struct page *page) -{ - struct address_space *mapping = page_mapping(page); - unsigned long flags; - - if (mapping) { - write_lock_irqsave(&mapping->tree_lock, flags); - if (TestClearPageDirty(page)) { - radix_tree_tag_clear(&mapping->page_tree, - page_index(page), - PAGECACHE_TAG_DIRTY); - write_unlock_irqrestore(&mapping->tree_lock, flags); - /* - * We can continue to use `mapping' here because the - * page is locked, which pins the address_space - */ - if (mapping_cap_account_dirty(mapping)) { - page_mkclean(page); - dec_zone_page_state(page, NR_FILE_DIRTY); - } - return 1; - } - write_unlock_irqrestore(&mapping->tree_lock, flags); - return 0; - } - return TestClearPageDirty(page); -} -EXPORT_SYMBOL(test_clear_page_dirty); - -/* * Clear a page's dirty flag, while caring for dirty memory accounting. * Returns true if the page was previously dirty. * diff -Naupr linux-2.6.19.orig/mm/rmap.c linux-2.6.19/mm/rmap.c --- linux-2.6.19.orig/mm/rmap.c 2006-11-29 14:57:37.000000000 -0700 +++ linux-2.6.19/mm/rmap.c 2006-12-22 23:25:09.000000000 -0700 @@ -432,7 +432,7 @@ static int page_mkclean_one(struct page { struct mm_struct *mm = vma->vm_mm; unsigned long address; - pte_t *pte, entry; + pte_t *pte; spinlock_t *ptl; int ret = 0; @@ -444,17 +444,18 @@ static int page_mkclean_one(struct page if (!pte) goto out; - if (!pte_dirty(*pte) && !pte_write(*pte)) - goto unlock; + if (pte_dirty(*pte) || pte_write(*pte)) { + pte_t entry; - entry = ptep_get_and_clear(mm, address, pte); - entry = pte_mkclean(entry); - entry = pte_wrprotect(entry); - ptep_establish(vma, address, pte, entry); - lazy_mmu_prot_update(entry); - ret = 1; + flush_cache_page(vma, address, pte_pfn(*pte)); + entry = ptep_clear_flush(vma, address, pte); + entry = pte_wrprotect(entry); + entry = pte_mkclean(entry); + set_pte_at(vma, address, pte, entry); + lazy_mmu_prot_update(entry); + ret = 1; + } -unlock: pte_unmap_unlock(pte, ptl); out: return ret; @@ -489,6 +490,8 @@ int page_mkclean(struct page *page) if (mapping) ret = page_mkclean_file(mapping, page); } + if (page_test_and_clear_dirty(page)) + ret = 1; return ret; } @@ -587,8 +590,6 @@ void page_remove_rmap(struct page *page) * Leaving it set also helps swapoff to reinstate ptes * faster for those pages still in swapcache. */ - if (page_test_and_clear_dirty(page)) - set_page_dirty(page); __dec_zone_page_state(page, PageAnon(page) ? NR_ANON_PAGES : NR_FILE_MAPPED); } @@ -607,6 +608,7 @@ static int try_to_unmap_one(struct page pte_t pteval; spinlock_t *ptl; int ret = SWAP_AGAIN; + struct page *dirty_page = NULL; address = vma_address(page, vma); if (address == -EFAULT) @@ -633,7 +635,7 @@ static int try_to_unmap_one(struct page /* Move the dirty bit to the physical page now the pte is gone. */ if (pte_dirty(pteval)) - set_page_dirty(page); + dirty_page = page; /* Update high watermark before we lower rss */ update_hiwater_rss(mm); @@ -684,6 +686,8 @@ static int try_to_unmap_one(struct page out_unmap: pte_unmap_unlock(pte, ptl); + if (dirty_page) + set_page_dirty(dirty_page); out: return ret; } @@ -915,6 +919,9 @@ int try_to_unmap(struct page *page, int else ret = try_to_unmap_file(page, migration); + if (page_test_and_clear_dirty(page)) + set_page_dirty(page); + if (!page_mapped(page)) ret = SWAP_SUCCESS; return ret; diff -Naupr linux-2.6.19.orig/mm/truncate.c linux-2.6.19/mm/truncate.c --- linux-2.6.19.orig/mm/truncate.c 2006-11-29 14:57:37.000000000 -0700 +++ linux-2.6.19/mm/truncate.c 2006-12-23 13:21:42.000000000 -0700 @@ -50,6 +50,21 @@ static inline void truncate_partial_page do_invalidatepage(page, partial); } +void cancel_dirty_page(struct page *page, unsigned int account_size) +{ + /* If we're cancelling the page, it had better not be mapped any more */+ if (page_mapped(page)) { + static unsigned int warncount; + + WARN_ON(++warncount < 5); + } + + if (TestClearPageDirty(page) && account_size && + mapping_cap_account_dirty(page->mapping)) + dec_zone_page_state(page, NR_FILE_DIRTY); +} + + /* * If truncate cannot remove the fs-private metadata from the page, the page * becomes anonymous. It will be left on the LRU and may even be mapped into @@ -66,10 +81,11 @@ truncate_complete_page(struct address_sp if (page->mapping != mapping) return; + cancel_dirty_page(page, PAGE_CACHE_SIZE); + if (PagePrivate(page)) do_invalidatepage(page, 0); - clear_page_dirty(page); ClearPageUptodate(page); ClearPageMappedToDisk(page); remove_from_page_cache(page); @@ -348,7 +364,6 @@ int invalidate_inode_pages2_range(struct for (i = 0; !ret && i < pagevec_count(&pvec); i++) { struct page *page = pvec.pages[i]; pgoff_t page_index; - int was_dirty; lock_page(page); if (page->mapping != mapping) { @@ -384,12 +399,8 @@ int invalidate_inode_pages2_range(struct PAGE_CACHE_SIZE, 0); } } - was_dirty = test_clear_page_dirty(page); - if (!invalidate_complete_page2(mapping, page)) { - if (was_dirty) - set_page_dirty(page); + if (!invalidate_complete_page2(mapping, page)) ret = -EIO; - } unlock_page(page); } pagevec_release(&pvec); -- Gordon Farquharson ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-24 19:27 ` Gordon Farquharson @ 2006-12-24 19:35 ` Linus Torvalds 2006-12-24 20:10 ` Andrei Popa 2006-12-24 22:01 ` Martin Michlmayr 0 siblings, 2 replies; 311+ messages in thread From: Linus Torvalds @ 2006-12-24 19:35 UTC (permalink / raw) To: Gordon Farquharson Cc: Andrei Popa, Andrew Morton, Martin Michlmayr, Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven, Linux Kernel Mailing List On Sun, 24 Dec 2006, Gordon Farquharson wrote: > > The apt cache files (/var/cache/apt/*.bin) still get corrupted with > this patch and 2.6.19. Yeah, if my guess about do_no_page() is right, _none_ of the previous patches should have ANY effect what-so-ever. In fact, I'd say that even the "ext3 works in writeback mode" thing that Andrei reports is probably a total fluke brought on by timing changes rather than anything else. So please try the latest patch instead (on top of anything that shows corruption reliably - the patch should be _totally_ independent of all the other issues, and I think it will apply cleanly on top of 2.6.18.3 and 2.6.19 too, so anything that shows corruption is a fine target - but try to choose something that has been the "best" at corrupting things for you, to make the testing as good as possible). Patch included here again (although I think you were cc'd on my previous email too, so you should already have it, and our emails just crossed) And if this doesn't fix it, I don't know what will.. Linus --- diff --git a/mm/memory.c b/mm/memory.c index 563792f..cf429c4 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2247,21 +2249,23 @@ retry: if (pte_none(*page_table)) { flush_icache_page(vma, new_page); entry = mk_pte(new_page, vma->vm_page_prot); - if (write_access) - entry = maybe_mkwrite(pte_mkdirty(entry), vma); - set_pte_at(mm, address, page_table, entry); if (anon) { inc_mm_counter(mm, anon_rss); lru_cache_add_active(new_page); page_add_new_anon_rmap(new_page, vma, address); + if (write_access) + entry = maybe_mkwrite(pte_mkdirty(entry), vma); } else { inc_mm_counter(mm, file_rss); page_add_file_rmap(new_page); + entry = pte_wrprotect(entry); if (write_access) { dirty_page = new_page; get_page(dirty_page); + entry = maybe_mkwrite(pte_mkdirty(entry), vma); } } + set_pte_at(mm, address, page_table, entry); } else { /* One of our sibling threads was faster, back out. */ page_cache_release(new_page); ^ permalink raw reply related [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-24 19:35 ` Linus Torvalds @ 2006-12-24 20:10 ` Andrei Popa 2006-12-24 20:24 ` Linus Torvalds 2006-12-24 22:01 ` Martin Michlmayr 1 sibling, 1 reply; 311+ messages in thread From: Andrei Popa @ 2006-12-24 20:10 UTC (permalink / raw) To: Linus Torvalds Cc: Gordon Farquharson, Andrew Morton, Martin Michlmayr, Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven, Linux Kernel Mailing List On Sun, 2006-12-24 at 11:35 -0800, Linus Torvalds wrote: > > On Sun, 24 Dec 2006, Gordon Farquharson wrote: > > > > The apt cache files (/var/cache/apt/*.bin) still get corrupted with > > this patch and 2.6.19. > > Yeah, if my guess about do_no_page() is right, _none_ of the previous > patches should have ANY effect what-so-ever. In fact, I'd say that even > the "ext3 works in writeback mode" thing that Andrei reports is probably a > total fluke brought on by timing changes rather than anything else. > > So please try the latest patch instead (on top of anything that shows > corruption reliably - the patch should be _totally_ independent of all the > other issues, and I think it will apply cleanly on top of 2.6.18.3 and > 2.6.19 too, so anything that shows corruption is a fine target - but try > to choose something that has been the "best" at corrupting things for you, > to make the testing as good as possible). > > Patch included here again (although I think you were cc'd on my previous > email too, so you should already have it, and our emails just crossed) > > And if this doesn't fix it, I don't know what will.. With latest git and patches: http://lkml.org/lkml/diff/2006/12/24/56/1 http://lkml.org/lkml/diff/2006/12/24/61/1 Hash check on download completion found bad chunks, consider using "safe_sync". > > Linus > > --- > diff --git a/mm/memory.c b/mm/memory.c > index 563792f..cf429c4 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -2247,21 +2249,23 @@ retry: > if (pte_none(*page_table)) { > flush_icache_page(vma, new_page); > entry = mk_pte(new_page, vma->vm_page_prot); > - if (write_access) > - entry = maybe_mkwrite(pte_mkdirty(entry), vma); > - set_pte_at(mm, address, page_table, entry); > if (anon) { > inc_mm_counter(mm, anon_rss); > lru_cache_add_active(new_page); > page_add_new_anon_rmap(new_page, vma, address); > + if (write_access) > + entry = maybe_mkwrite(pte_mkdirty(entry), vma); > } else { > inc_mm_counter(mm, file_rss); > page_add_file_rmap(new_page); > + entry = pte_wrprotect(entry); > if (write_access) { > dirty_page = new_page; > get_page(dirty_page); > + entry = maybe_mkwrite(pte_mkdirty(entry), vma); > } > } > + set_pte_at(mm, address, page_table, entry); > } else { > /* One of our sibling threads was faster, back out. */ > page_cache_release(new_page); ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-24 20:10 ` Andrei Popa @ 2006-12-24 20:24 ` Linus Torvalds 2006-12-24 20:30 ` Andrei Popa 2006-12-26 17:51 ` Al Viro 0 siblings, 2 replies; 311+ messages in thread From: Linus Torvalds @ 2006-12-24 20:24 UTC (permalink / raw) To: Andrei Popa Cc: Gordon Farquharson, Andrew Morton, Martin Michlmayr, Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven, Linux Kernel Mailing List On Sun, 24 Dec 2006, Andrei Popa wrote: > > Hash check on download completion found bad chunks, consider using > "safe_sync". Dang. Did you get any warning messages from the kernel? Linus ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-24 20:24 ` Linus Torvalds @ 2006-12-24 20:30 ` Andrei Popa 2006-12-26 17:51 ` Al Viro 1 sibling, 0 replies; 311+ messages in thread From: Andrei Popa @ 2006-12-24 20:30 UTC (permalink / raw) To: Linus Torvalds Cc: Gordon Farquharson, Andrew Morton, Martin Michlmayr, Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven, Linux Kernel Mailing List On Sun, 2006-12-24 at 12:24 -0800, Linus Torvalds wrote: > > On Sun, 24 Dec 2006, Andrei Popa wrote: > > > > Hash check on download completion found bad chunks, consider using > > "safe_sync". > > Dang. Did you get any warning messages from the kernel? > only these: ACPI: EC: evaluating _Q80 ACPI: EC: evaluating _Q80 ACPI: EC: evaluating _Q80 but I don't think has anything to do with... > Linus ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-24 20:24 ` Linus Torvalds 2006-12-24 20:30 ` Andrei Popa @ 2006-12-26 17:51 ` Al Viro 2006-12-26 17:58 ` Al Viro 1 sibling, 1 reply; 311+ messages in thread From: Al Viro @ 2006-12-26 17:51 UTC (permalink / raw) To: Linus Torvalds Cc: Andrei Popa, Gordon Farquharson, Andrew Morton, Martin Michlmayr, Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven, Linux Kernel Mailing List On Sun, Dec 24, 2006 at 12:24:46PM -0800, Linus Torvalds wrote: > > > On Sun, 24 Dec 2006, Andrei Popa wrote: > > > > Hash check on download completion found bad chunks, consider using > > "safe_sync". > > Dang. Did you get any warning messages from the kernel? > > Linus BTW, rmap.c patch is broken - needs at least Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> --- diff --git a/mm/rmap.c b/mm/rmap.c index 57306fa..669acb2 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -452,7 +452,7 @@ static int page_mkclean_one(struct page entry = ptep_clear_flush(vma, address, pte); entry = pte_wrprotect(entry); entry = pte_mkclean(entry); - set_pte_at(vma, address, pte, entry); + set_pte_at(mm, address, pte, entry); lazy_mmu_prot_update(entry); ret = 1; } ^ permalink raw reply related [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-26 17:51 ` Al Viro @ 2006-12-26 17:58 ` Al Viro 0 siblings, 0 replies; 311+ messages in thread From: Al Viro @ 2006-12-26 17:58 UTC (permalink / raw) To: Linus Torvalds Cc: Andrei Popa, Gordon Farquharson, Andrew Morton, Martin Michlmayr, Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven, Linux Kernel Mailing List On Tue, Dec 26, 2006 at 05:51:55PM +0000, Al Viro wrote: > On Sun, Dec 24, 2006 at 12:24:46PM -0800, Linus Torvalds wrote: > > > > > > On Sun, 24 Dec 2006, Andrei Popa wrote: > > > > > > Hash check on download completion found bad chunks, consider using > > > "safe_sync". > > > > Dang. Did you get any warning messages from the kernel? > > > > Linus > > BTW, rmap.c patch is broken - needs at least ... but that doesn't affect most of the architectures - only sparc64 and some of powerpc. So it's definitely not enough. ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-24 19:35 ` Linus Torvalds 2006-12-24 20:10 ` Andrei Popa @ 2006-12-24 22:01 ` Martin Michlmayr 1 sibling, 0 replies; 311+ messages in thread From: Martin Michlmayr @ 2006-12-24 22:01 UTC (permalink / raw) To: Linus Torvalds Cc: Gordon Farquharson, Andrei Popa, Andrew Morton, Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven, Linux Kernel Mailing List * Linus Torvalds <torvalds@osdl.org> [2006-12-24 11:35]: > And if this doesn't fix it, I don't know what will.. Sorry, but it still fails (on top of plain 2.6.19). -- Martin Michlmayr http://www.cyrius.com/ ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-24 8:57 ` Andrew Morton 2006-12-24 9:26 ` Linus Torvalds 2006-12-24 12:14 ` Andrei Popa @ 2006-12-24 14:05 ` Martin Michlmayr 2 siblings, 0 replies; 311+ messages in thread From: Martin Michlmayr @ 2006-12-24 14:05 UTC (permalink / raw) To: Andrew Morton Cc: Linus Torvalds, Gordon Farquharson, Peter Zijlstra, Andrei Popa, Hugh Dickins, Nick Piggin, Arjan van de Ven, Linux Kernel Mailing List * Andrew Morton <akpm@osdl.org> [2006-12-24 00:57]: > /etc/fstab: ext2 nobh > /etc/fstab: ext3 data=writeback,nobh It seems that busybox mount ignores the nobh option but both ext2 and ext3 data=writeback work for me. This is with plain 2.6.19 which normally always fails. -- Martin Michlmayr http://www.cyrius.com/ ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-24 8:43 ` Linus Torvalds 2006-12-24 8:57 ` Andrew Morton @ 2006-12-26 16:17 ` Tobias Diedrich 2006-12-27 4:55 ` [PATCH] mm: fix page_mkclean_one David Miller 1 sibling, 1 reply; 311+ messages in thread From: Tobias Diedrich @ 2006-12-26 16:17 UTC (permalink / raw) To: Linus Torvalds Cc: Gordon Farquharson, Martin Michlmayr, Peter Zijlstra, Andrei Popa, Andrew Morton, Hugh Dickins, Nick Piggin, Arjan van de Ven, Linux Kernel Mailing List Linus Torvalds wrote: > I don't think it's a page table issue any more, it just doesn't look > likely with the ARM UP corruption. It's also not apparently even on a > cacheline boundary, so it probably is really a dirty bit that got cleared > wrogn due to some race with IO. So, until now it's only been reported for SMP on i386? I'm seeing the issue on my Pentium-M Notebook (Thinkpad R52) over here, UP kernel, no preempt. I've first seen it with 2.6.20-rc1, but am running 2.6.20-rc2 now. The corruption pattern looks like the one already reported, rtorrent hash check fails (for some files it succeeds at first, but fails after "echo 1 > /proc/sys/vm/drop_caches"), the corruption is zeroes at the end of page instead of data. ii rtorrent 0.6.4-1 ncurses BitTorrent client based on LibTorren ii libtorrent9 0.10.4-1 a C++ BitTorrent library .config: # Automatically generated make config: don't edit # Linux kernel version: 2.6.20-rc2 # Mon Dec 25 14:00:03 2006 # CONFIG_X86_32=y CONFIG_GENERIC_TIME=y CONFIG_LOCKDEP_SUPPORT=y CONFIG_STACKTRACE_SUPPORT=y CONFIG_SEMAPHORE_SLEEPERS=y CONFIG_X86=y CONFIG_MMU=y CONFIG_GENERIC_ISA_DMA=y CONFIG_GENERIC_IOMAP=y CONFIG_GENERIC_BUG=y CONFIG_GENERIC_HWEIGHT=y CONFIG_ARCH_MAY_HAVE_PC_FDC=y CONFIG_DMI=y CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config" # # Code maturity level options # CONFIG_EXPERIMENTAL=y CONFIG_BROKEN_ON_SMP=y CONFIG_INIT_ENV_ARG_LIMIT=32 # # General setup # CONFIG_LOCALVERSION="" CONFIG_LOCALVERSION_AUTO=y CONFIG_SWAP=y CONFIG_SYSVIPC=y # CONFIG_IPC_NS is not set CONFIG_POSIX_MQUEUE=y # CONFIG_BSD_PROCESS_ACCT is not set # CONFIG_TASKSTATS is not set # CONFIG_UTS_NS is not set # CONFIG_AUDIT is not set CONFIG_IKCONFIG=y CONFIG_IKCONFIG_PROC=y # CONFIG_SYSFS_DEPRECATED is not set CONFIG_RELAY=y CONFIG_INITRAMFS_SOURCE="" CONFIG_CC_OPTIMIZE_FOR_SIZE=y CONFIG_SYSCTL=y # CONFIG_EMBEDDED is not set CONFIG_UID16=y CONFIG_SYSCTL_SYSCALL=y CONFIG_KALLSYMS=y # CONFIG_KALLSYMS_ALL is not set # CONFIG_KALLSYMS_EXTRA_PASS is not set CONFIG_HOTPLUG=y CONFIG_PRINTK=y CONFIG_BUG=y CONFIG_ELF_CORE=y CONFIG_BASE_FULL=y CONFIG_FUTEX=y CONFIG_EPOLL=y CONFIG_SHMEM=y CONFIG_SLAB=y CONFIG_VM_EVENT_COUNTERS=y CONFIG_RT_MUTEXES=y # CONFIG_TINY_SHMEM is not set CONFIG_BASE_SMALL=0 # CONFIG_SLOB is not set # # Loadable module support # CONFIG_MODULES=y CONFIG_MODULE_UNLOAD=y CONFIG_MODULE_FORCE_UNLOAD=y # CONFIG_MODVERSIONS is not set # CONFIG_MODULE_SRCVERSION_ALL is not set CONFIG_KMOD=y # # Block layer # CONFIG_BLOCK=y CONFIG_LBD=y CONFIG_BLK_DEV_IO_TRACE=y # CONFIG_LSF is not set # # IO Schedulers # CONFIG_IOSCHED_NOOP=y CONFIG_IOSCHED_AS=y CONFIG_IOSCHED_DEADLINE=y CONFIG_IOSCHED_CFQ=y CONFIG_DEFAULT_AS=y # CONFIG_DEFAULT_DEADLINE is not set # CONFIG_DEFAULT_CFQ is not set # CONFIG_DEFAULT_NOOP is not set CONFIG_DEFAULT_IOSCHED="anticipatory" # # Processor type and features # # CONFIG_SMP is not set CONFIG_X86_PC=y # CONFIG_X86_ELAN is not set # CONFIG_X86_VOYAGER is not set # CONFIG_X86_NUMAQ is not set # CONFIG_X86_SUMMIT is not set # CONFIG_X86_BIGSMP is not set # CONFIG_X86_VISWS is not set # CONFIG_X86_GENERICARCH is not set # CONFIG_X86_ES7000 is not set # CONFIG_PARAVIRT is not set # CONFIG_M386 is not set # CONFIG_M486 is not set # CONFIG_M586 is not set # CONFIG_M586TSC is not set # CONFIG_M586MMX is not set # CONFIG_M686 is not set # CONFIG_MPENTIUMII is not set # CONFIG_MPENTIUMIII is not set CONFIG_MPENTIUMM=y # CONFIG_MCORE2 is not set # CONFIG_MPENTIUM4 is not set # CONFIG_MK6 is not set # CONFIG_MK7 is not set # CONFIG_MK8 is not set # CONFIG_MCRUSOE is not set # CONFIG_MEFFICEON is not set # CONFIG_MWINCHIPC6 is not set # CONFIG_MWINCHIP2 is not set # CONFIG_MWINCHIP3D is not set # CONFIG_MGEODEGX1 is not set # CONFIG_MGEODE_LX is not set # CONFIG_MCYRIXIII is not set # CONFIG_MVIAC3_2 is not set # CONFIG_X86_GENERIC is not set CONFIG_X86_CMPXCHG=y CONFIG_X86_XADD=y CONFIG_X86_L1_CACHE_SHIFT=6 CONFIG_RWSEM_XCHGADD_ALGORITHM=y # CONFIG_ARCH_HAS_ILOG2_U32 is not set # CONFIG_ARCH_HAS_ILOG2_U64 is not set CONFIG_GENERIC_CALIBRATE_DELAY=y CONFIG_X86_WP_WORKS_OK=y CONFIG_X86_INVLPG=y CONFIG_X86_BSWAP=y CONFIG_X86_POPAD_OK=y CONFIG_X86_CMPXCHG64=y CONFIG_X86_GOOD_APIC=y CONFIG_X86_INTEL_USERCOPY=y CONFIG_X86_USE_PPRO_CHECKSUM=y CONFIG_X86_TSC=y CONFIG_HPET_TIMER=y CONFIG_HPET_EMULATE_RTC=y CONFIG_PREEMPT_NONE=y # CONFIG_PREEMPT_VOLUNTARY is not set # CONFIG_PREEMPT is not set CONFIG_X86_UP_APIC=y CONFIG_X86_UP_IOAPIC=y CONFIG_X86_LOCAL_APIC=y CONFIG_X86_IO_APIC=y CONFIG_X86_MCE=y CONFIG_X86_MCE_NONFATAL=y CONFIG_X86_MCE_P4THERMAL=y CONFIG_VM86=y # CONFIG_TOSHIBA is not set # CONFIG_I8K is not set # CONFIG_X86_REBOOTFIXUPS is not set # CONFIG_MICROCODE is not set # CONFIG_X86_MSR is not set # CONFIG_X86_CPUID is not set # # Firmware Drivers # # CONFIG_EDD is not set # CONFIG_DELL_RBU is not set CONFIG_DCDBAS=m CONFIG_NOHIGHMEM=y # CONFIG_HIGHMEM4G is not set # CONFIG_HIGHMEM64G is not set CONFIG_PAGE_OFFSET=0xC0000000 CONFIG_ARCH_FLATMEM_ENABLE=y CONFIG_ARCH_SPARSEMEM_ENABLE=y CONFIG_ARCH_SELECT_MEMORY_MODEL=y CONFIG_ARCH_POPULATES_NODE_MAP=y CONFIG_SELECT_MEMORY_MODEL=y CONFIG_FLATMEM_MANUAL=y # CONFIG_DISCONTIGMEM_MANUAL is not set # CONFIG_SPARSEMEM_MANUAL is not set CONFIG_FLATMEM=y CONFIG_FLAT_NODE_MEM_MAP=y CONFIG_SPARSEMEM_STATIC=y CONFIG_SPLIT_PTLOCK_CPUS=4 # CONFIG_RESOURCES_64BIT is not set # CONFIG_MATH_EMULATION is not set CONFIG_MTRR=y # CONFIG_EFI is not set # CONFIG_SECCOMP is not set # CONFIG_HZ_100 is not set # CONFIG_HZ_250 is not set CONFIG_HZ_300=y # CONFIG_HZ_1000 is not set CONFIG_HZ=300 # CONFIG_KEXEC is not set # CONFIG_RELOCATABLE is not set CONFIG_PHYSICAL_ALIGN=0x100000 CONFIG_COMPAT_VDSO=y # # Power management options (ACPI, APM) # CONFIG_PM=y # CONFIG_PM_LEGACY is not set # CONFIG_PM_DEBUG is not set # CONFIG_PM_SYSFS_DEPRECATED is not set CONFIG_SOFTWARE_SUSPEND=y CONFIG_PM_STD_PARTITION="" # # ACPI (Advanced Configuration and Power Interface) Support # CONFIG_ACPI=y CONFIG_ACPI_SLEEP=y CONFIG_ACPI_SLEEP_PROC_FS=y # CONFIG_ACPI_SLEEP_PROC_SLEEP is not set CONFIG_ACPI_AC=y CONFIG_ACPI_BATTERY=y CONFIG_ACPI_BUTTON=y CONFIG_ACPI_VIDEO=y CONFIG_ACPI_HOTKEY=m CONFIG_ACPI_FAN=y CONFIG_ACPI_DOCK=y CONFIG_ACPI_PROCESSOR=y CONFIG_ACPI_THERMAL=y # CONFIG_ACPI_ASUS is not set CONFIG_ACPI_IBM=m # CONFIG_ACPI_TOSHIBA is not set # CONFIG_ACPI_CUSTOM_DSDT is not set CONFIG_ACPI_BLACKLIST_YEAR=0 # CONFIG_ACPI_DEBUG is not set CONFIG_ACPI_EC=y CONFIG_ACPI_POWER=y CONFIG_ACPI_SYSTEM=y CONFIG_X86_PM_TIMER=y # CONFIG_ACPI_CONTAINER is not set # CONFIG_ACPI_SBS is not set # # APM (Advanced Power Management) BIOS Support # # CONFIG_APM is not set # # CPU Frequency scaling # CONFIG_CPU_FREQ=y CONFIG_CPU_FREQ_TABLE=y # CONFIG_CPU_FREQ_DEBUG is not set CONFIG_CPU_FREQ_STAT=y # CONFIG_CPU_FREQ_STAT_DETAILS is not set CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE=y # CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set CONFIG_CPU_FREQ_GOV_PERFORMANCE=y CONFIG_CPU_FREQ_GOV_POWERSAVE=y CONFIG_CPU_FREQ_GOV_USERSPACE=y CONFIG_CPU_FREQ_GOV_ONDEMAND=y CONFIG_CPU_FREQ_GOV_CONSERVATIVE=y # # CPUFreq processor drivers # # CONFIG_X86_ACPI_CPUFREQ is not set # CONFIG_X86_POWERNOW_K6 is not set # CONFIG_X86_POWERNOW_K7 is not set # CONFIG_X86_POWERNOW_K8 is not set # CONFIG_X86_GX_SUSPMOD is not set CONFIG_X86_SPEEDSTEP_CENTRINO=y CONFIG_X86_SPEEDSTEP_CENTRINO_ACPI=y CONFIG_X86_SPEEDSTEP_CENTRINO_TABLE=y CONFIG_X86_SPEEDSTEP_ICH=y CONFIG_X86_SPEEDSTEP_SMI=y # CONFIG_X86_P4_CLOCKMOD is not set # CONFIG_X86_CPUFREQ_NFORCE2 is not set # CONFIG_X86_LONGRUN is not set # CONFIG_X86_LONGHAUL is not set # # shared options # # CONFIG_X86_ACPI_CPUFREQ_PROC_INTF is not set CONFIG_X86_SPEEDSTEP_LIB=y # CONFIG_X86_SPEEDSTEP_RELAXED_CAP_CHECK is not set # # Bus options (PCI, PCMCIA, EISA, MCA, ISA) # CONFIG_PCI=y # CONFIG_PCI_GOBIOS is not set # CONFIG_PCI_GOMMCONFIG is not set # CONFIG_PCI_GODIRECT is not set CONFIG_PCI_GOANY=y CONFIG_PCI_BIOS=y CONFIG_PCI_DIRECT=y CONFIG_PCI_MMCONFIG=y CONFIG_PCIEPORTBUS=y # CONFIG_HOTPLUG_PCI_PCIE is not set CONFIG_PCIEAER=y CONFIG_PCI_MSI=y # CONFIG_PCI_MULTITHREAD_PROBE is not set # CONFIG_PCI_DEBUG is not set CONFIG_HT_IRQ=y CONFIG_ISA_DMA_API=y CONFIG_ISA=y # CONFIG_EISA is not set # CONFIG_MCA is not set # CONFIG_SCx200 is not set # # PCCARD (PCMCIA/CardBus) support # CONFIG_PCCARD=y # CONFIG_PCMCIA_DEBUG is not set CONFIG_PCMCIA=y CONFIG_PCMCIA_LOAD_CIS=y CONFIG_PCMCIA_IOCTL=y CONFIG_CARDBUS=y # # PC-card bridges # CONFIG_YENTA=y CONFIG_YENTA_O2=y CONFIG_YENTA_RICOH=y CONFIG_YENTA_TI=y CONFIG_YENTA_ENE_TUNE=y CONFIG_YENTA_TOSHIBA=y # CONFIG_PD6729 is not set # CONFIG_I82092 is not set # CONFIG_I82365 is not set # CONFIG_TCIC is not set CONFIG_PCMCIA_PROBE=y CONFIG_PCCARD_NONSTATIC=y # # PCI Hotplug Support # CONFIG_HOTPLUG_PCI=y # CONFIG_HOTPLUG_PCI_FAKE is not set # CONFIG_HOTPLUG_PCI_COMPAQ is not set CONFIG_HOTPLUG_PCI_IBM=y CONFIG_HOTPLUG_PCI_ACPI=y CONFIG_HOTPLUG_PCI_ACPI_IBM=y # CONFIG_HOTPLUG_PCI_CPCI is not set # CONFIG_HOTPLUG_PCI_SHPC is not set # # Executable file formats # CONFIG_BINFMT_ELF=y CONFIG_BINFMT_AOUT=y CONFIG_BINFMT_MISC=y # # Networking # CONFIG_NET=y # # Networking options # # CONFIG_NETDEBUG is not set CONFIG_PACKET=y CONFIG_PACKET_MMAP=y CONFIG_UNIX=y CONFIG_XFRM=y # CONFIG_XFRM_USER is not set # CONFIG_XFRM_SUB_POLICY is not set # CONFIG_NET_KEY is not set CONFIG_INET=y CONFIG_IP_MULTICAST=y # CONFIG_IP_ADVANCED_ROUTER is not set CONFIG_IP_FIB_HASH=y # CONFIG_IP_PNP is not set # CONFIG_NET_IPIP is not set # CONFIG_NET_IPGRE is not set # CONFIG_IP_MROUTE is not set # CONFIG_ARPD is not set CONFIG_SYN_COOKIES=y # CONFIG_INET_AH is not set # CONFIG_INET_ESP is not set # CONFIG_INET_IPCOMP is not set # CONFIG_INET_XFRM_TUNNEL is not set # CONFIG_INET_TUNNEL is not set CONFIG_INET_XFRM_MODE_TRANSPORT=y CONFIG_INET_XFRM_MODE_TUNNEL=y CONFIG_INET_XFRM_MODE_BEET=y CONFIG_INET_DIAG=y CONFIG_INET_TCP_DIAG=y CONFIG_TCP_CONG_ADVANCED=y CONFIG_TCP_CONG_BIC=y CONFIG_TCP_CONG_CUBIC=y CONFIG_TCP_CONG_WESTWOOD=y # CONFIG_TCP_CONG_HTCP is not set CONFIG_TCP_CONG_HSTCP=y # CONFIG_TCP_CONG_HYBLA is not set CONFIG_TCP_CONG_VEGAS=y # CONFIG_TCP_CONG_SCALABLE is not set # CONFIG_TCP_CONG_LP is not set # CONFIG_TCP_CONG_VENO is not set # CONFIG_DEFAULT_BIC is not set CONFIG_DEFAULT_CUBIC=y # CONFIG_DEFAULT_HTCP is not set # CONFIG_DEFAULT_VEGAS is not set # CONFIG_DEFAULT_WESTWOOD is not set # CONFIG_DEFAULT_RENO is not set CONFIG_DEFAULT_TCP_CONG="cubic" # CONFIG_TCP_MD5SIG is not set # # IP: Virtual Server Configuration # # CONFIG_IP_VS is not set CONFIG_IPV6=y # CONFIG_IPV6_PRIVACY is not set CONFIG_IPV6_ROUTER_PREF=y CONFIG_IPV6_ROUTE_INFO=y # CONFIG_INET6_AH is not set # CONFIG_INET6_ESP is not set # CONFIG_INET6_IPCOMP is not set # CONFIG_IPV6_MIP6 is not set # CONFIG_INET6_XFRM_TUNNEL is not set CONFIG_INET6_TUNNEL=y CONFIG_INET6_XFRM_MODE_TRANSPORT=y CONFIG_INET6_XFRM_MODE_TUNNEL=y CONFIG_INET6_XFRM_MODE_BEET=y # CONFIG_INET6_XFRM_MODE_ROUTEOPTIMIZATION is not set CONFIG_IPV6_SIT=y CONFIG_IPV6_TUNNEL=y # CONFIG_IPV6_MULTIPLE_TABLES is not set # CONFIG_NETWORK_SECMARK is not set CONFIG_NETFILTER=y # CONFIG_NETFILTER_DEBUG is not set CONFIG_BRIDGE_NETFILTER=y # # Core Netfilter Configuration # CONFIG_NETFILTER_NETLINK=y CONFIG_NETFILTER_NETLINK_QUEUE=y CONFIG_NETFILTER_NETLINK_LOG=y # CONFIG_NF_CONNTRACK_ENABLED is not set CONFIG_NETFILTER_XTABLES=y CONFIG_NETFILTER_XT_TARGET_CLASSIFY=y # CONFIG_NETFILTER_XT_TARGET_DSCP is not set CONFIG_NETFILTER_XT_TARGET_MARK=y CONFIG_NETFILTER_XT_TARGET_NFQUEUE=y # CONFIG_NETFILTER_XT_TARGET_NFLOG is not set CONFIG_NETFILTER_XT_MATCH_COMMENT=y # CONFIG_NETFILTER_XT_MATCH_DCCP is not set # CONFIG_NETFILTER_XT_MATCH_DSCP is not set # CONFIG_NETFILTER_XT_MATCH_ESP is not set # CONFIG_NETFILTER_XT_MATCH_LENGTH is not set CONFIG_NETFILTER_XT_MATCH_LIMIT=y CONFIG_NETFILTER_XT_MATCH_MAC=y CONFIG_NETFILTER_XT_MATCH_MARK=y # CONFIG_NETFILTER_XT_MATCH_POLICY is not set CONFIG_NETFILTER_XT_MATCH_MULTIPORT=y CONFIG_NETFILTER_XT_MATCH_PHYSDEV=y CONFIG_NETFILTER_XT_MATCH_PKTTYPE=y # CONFIG_NETFILTER_XT_MATCH_QUOTA is not set CONFIG_NETFILTER_XT_MATCH_REALM=y # CONFIG_NETFILTER_XT_MATCH_SCTP is not set # CONFIG_NETFILTER_XT_MATCH_STATISTIC is not set # CONFIG_NETFILTER_XT_MATCH_STRING is not set CONFIG_NETFILTER_XT_MATCH_TCPMSS=y # CONFIG_NETFILTER_XT_MATCH_HASHLIMIT is not set # # IP: Netfilter Configuration # CONFIG_IP_NF_QUEUE=y CONFIG_IP_NF_IPTABLES=y CONFIG_IP_NF_MATCH_IPRANGE=y CONFIG_IP_NF_MATCH_TOS=y # CONFIG_IP_NF_MATCH_RECENT is not set CONFIG_IP_NF_MATCH_ECN=y CONFIG_IP_NF_MATCH_AH=y # CONFIG_IP_NF_MATCH_TTL is not set CONFIG_IP_NF_MATCH_OWNER=y CONFIG_IP_NF_MATCH_ADDRTYPE=y CONFIG_IP_NF_FILTER=y CONFIG_IP_NF_TARGET_REJECT=y CONFIG_IP_NF_TARGET_LOG=y # CONFIG_IP_NF_TARGET_ULOG is not set CONFIG_IP_NF_TARGET_TCPMSS=y CONFIG_IP_NF_MANGLE=y CONFIG_IP_NF_TARGET_TOS=y CONFIG_IP_NF_TARGET_ECN=y # CONFIG_IP_NF_TARGET_TTL is not set # CONFIG_IP_NF_RAW is not set # CONFIG_IP_NF_ARPTABLES is not set # # IPv6: Netfilter Configuration (EXPERIMENTAL) # CONFIG_IP6_NF_QUEUE=y # CONFIG_IP6_NF_IPTABLES is not set # # Bridge: Netfilter Configuration # # CONFIG_BRIDGE_NF_EBTABLES is not set # # DCCP Configuration (EXPERIMENTAL) # # CONFIG_IP_DCCP is not set # # SCTP Configuration (EXPERIMENTAL) # # CONFIG_IP_SCTP is not set # # TIPC Configuration (EXPERIMENTAL) # # CONFIG_TIPC is not set # CONFIG_ATM is not set CONFIG_BRIDGE=y CONFIG_VLAN_8021Q=y # CONFIG_DECNET is not set CONFIG_LLC=y # CONFIG_LLC2 is not set # CONFIG_IPX is not set # CONFIG_ATALK is not set # CONFIG_X25 is not set # CONFIG_LAPB is not set # CONFIG_ECONET is not set # CONFIG_WAN_ROUTER is not set # # QoS and/or fair queueing # CONFIG_NET_SCHED=y CONFIG_NET_SCH_FIFO=y # CONFIG_NET_SCH_CLK_JIFFIES is not set # CONFIG_NET_SCH_CLK_GETTIMEOFDAY is not set CONFIG_NET_SCH_CLK_CPU=y # # Queueing/Scheduling # CONFIG_NET_SCH_CBQ=y CONFIG_NET_SCH_HTB=y # CONFIG_NET_SCH_HFSC is not set CONFIG_NET_SCH_PRIO=y CONFIG_NET_SCH_RED=y CONFIG_NET_SCH_SFQ=y # CONFIG_NET_SCH_TEQL is not set CONFIG_NET_SCH_TBF=y CONFIG_NET_SCH_GRED=y CONFIG_NET_SCH_DSMARK=y CONFIG_NET_SCH_NETEM=y CONFIG_NET_SCH_INGRESS=y # # Classification # CONFIG_NET_CLS=y CONFIG_NET_CLS_BASIC=y CONFIG_NET_CLS_TCINDEX=y CONFIG_NET_CLS_ROUTE4=y CONFIG_NET_CLS_ROUTE=y # CONFIG_NET_CLS_FW is not set CONFIG_NET_CLS_U32=y # CONFIG_CLS_U32_PERF is not set # CONFIG_CLS_U32_MARK is not set # CONFIG_NET_CLS_RSVP is not set # CONFIG_NET_CLS_RSVP6 is not set # CONFIG_NET_EMATCH is not set # CONFIG_NET_CLS_ACT is not set # CONFIG_NET_CLS_POLICE is not set # CONFIG_NET_CLS_IND is not set # CONFIG_NET_ESTIMATOR is not set # # Network testing # # CONFIG_NET_PKTGEN is not set # CONFIG_HAMRADIO is not set # CONFIG_IRDA is not set CONFIG_BT=y CONFIG_BT_L2CAP=y CONFIG_BT_SCO=y CONFIG_BT_RFCOMM=y CONFIG_BT_RFCOMM_TTY=y CONFIG_BT_BNEP=y CONFIG_BT_BNEP_MC_FILTER=y CONFIG_BT_BNEP_PROTO_FILTER=y CONFIG_BT_HIDP=y # # Bluetooth device drivers # CONFIG_BT_HCIUSB=m CONFIG_BT_HCIUSB_SCO=y # CONFIG_BT_HCIUART is not set # CONFIG_BT_HCIBCM203X is not set # CONFIG_BT_HCIBPA10X is not set # CONFIG_BT_HCIBFUSB is not set # CONFIG_BT_HCIDTL1 is not set # CONFIG_BT_HCIBT3C is not set # CONFIG_BT_HCIBLUECARD is not set # CONFIG_BT_HCIBTUART is not set # CONFIG_BT_HCIVHCI is not set CONFIG_IEEE80211=y # CONFIG_IEEE80211_DEBUG is not set CONFIG_IEEE80211_CRYPT_WEP=y CONFIG_IEEE80211_CRYPT_CCMP=y CONFIG_IEEE80211_CRYPT_TKIP=y CONFIG_IEEE80211_SOFTMAC=y # CONFIG_IEEE80211_SOFTMAC_DEBUG is not set CONFIG_WIRELESS_EXT=y # # Device Drivers # # # Generic Driver Options # # CONFIG_STANDALONE is not set # CONFIG_PREVENT_FIRMWARE_BUILD is not set CONFIG_FW_LOADER=y # CONFIG_DEBUG_DRIVER is not set # CONFIG_SYS_HYPERVISOR is not set # # Connector - unified userspace <-> kernelspace linker # CONFIG_CONNECTOR=y # CONFIG_PROC_EVENTS is not set # # Memory Technology Devices (MTD) # CONFIG_MTD=m # CONFIG_MTD_DEBUG is not set # CONFIG_MTD_CONCAT is not set CONFIG_MTD_PARTITIONS=y # CONFIG_MTD_REDBOOT_PARTS is not set # # User Modules And Translation Layers # CONFIG_MTD_CHAR=m CONFIG_MTD_BLOCK=m # CONFIG_MTD_BLOCK_RO is not set CONFIG_FTL=m CONFIG_NFTL=m # CONFIG_NFTL_RW is not set CONFIG_INFTL=m CONFIG_RFD_FTL=m # CONFIG_SSFDC is not set # # RAM/ROM/Flash chip drivers # CONFIG_MTD_CFI=m CONFIG_MTD_JEDECPROBE=m CONFIG_MTD_GEN_PROBE=m # CONFIG_MTD_CFI_ADV_OPTIONS is not set CONFIG_MTD_MAP_BANK_WIDTH_1=y CONFIG_MTD_MAP_BANK_WIDTH_2=y CONFIG_MTD_MAP_BANK_WIDTH_4=y # CONFIG_MTD_MAP_BANK_WIDTH_8 is not set # CONFIG_MTD_MAP_BANK_WIDTH_16 is not set # CONFIG_MTD_MAP_BANK_WIDTH_32 is not set CONFIG_MTD_CFI_I1=y CONFIG_MTD_CFI_I2=y # CONFIG_MTD_CFI_I4 is not set # CONFIG_MTD_CFI_I8 is not set CONFIG_MTD_CFI_INTELEXT=m CONFIG_MTD_CFI_AMDSTD=m CONFIG_MTD_CFI_STAA=m CONFIG_MTD_CFI_UTIL=m CONFIG_MTD_RAM=m CONFIG_MTD_ROM=m # CONFIG_MTD_ABSENT is not set # CONFIG_MTD_OBSOLETE_CHIPS is not set # # Mapping drivers for chip access # CONFIG_MTD_COMPLEX_MAPPINGS=y # CONFIG_MTD_PHYSMAP is not set # CONFIG_MTD_PNC2000 is not set # CONFIG_MTD_NETSC520 is not set # CONFIG_MTD_TS5500 is not set # CONFIG_MTD_SBC_GXX is not set # CONFIG_MTD_AMD76XROM is not set # CONFIG_MTD_ICHXROM is not set # CONFIG_MTD_SCB2_FLASH is not set # CONFIG_MTD_NETtel is not set # CONFIG_MTD_L440GX is not set # CONFIG_MTD_PCI is not set # CONFIG_MTD_PLATRAM is not set # # Self-contained MTD device drivers # # CONFIG_MTD_PMC551 is not set # CONFIG_MTD_SLRAM is not set # CONFIG_MTD_PHRAM is not set # CONFIG_MTD_MTDRAM is not set CONFIG_MTD_BLOCK2MTD=m # # Disk-On-Chip Device Drivers # # CONFIG_MTD_DOC2000 is not set # CONFIG_MTD_DOC2001 is not set # CONFIG_MTD_DOC2001PLUS is not set # # NAND Flash Device Drivers # CONFIG_MTD_NAND=m # CONFIG_MTD_NAND_VERIFY_WRITE is not set # CONFIG_MTD_NAND_ECC_SMC is not set CONFIG_MTD_NAND_IDS=m # CONFIG_MTD_NAND_DISKONCHIP is not set # CONFIG_MTD_NAND_CS553X is not set # CONFIG_MTD_NAND_NANDSIM is not set # # OneNAND Flash Device Drivers # # CONFIG_MTD_ONENAND is not set # # Parallel port support # CONFIG_PARPORT=y CONFIG_PARPORT_PC=y # CONFIG_PARPORT_SERIAL is not set CONFIG_PARPORT_PC_FIFO=y # CONFIG_PARPORT_PC_SUPERIO is not set # CONFIG_PARPORT_PC_PCMCIA is not set # CONFIG_PARPORT_GSC is not set # CONFIG_PARPORT_AX88796 is not set # CONFIG_PARPORT_1284 is not set # # Plug and Play support # CONFIG_PNP=y # CONFIG_PNP_DEBUG is not set # # Protocols # # CONFIG_ISAPNP is not set # CONFIG_PNPBIOS is not set CONFIG_PNPACPI=y # # Block devices # # CONFIG_BLK_DEV_FD is not set # CONFIG_BLK_DEV_XD is not set # CONFIG_PARIDE is not set # CONFIG_BLK_CPQ_DA is not set # CONFIG_BLK_CPQ_CISS_DA is not set # CONFIG_BLK_DEV_DAC960 is not set # CONFIG_BLK_DEV_UMEM is not set # CONFIG_BLK_DEV_COW_COMMON is not set CONFIG_BLK_DEV_LOOP=y # CONFIG_BLK_DEV_CRYPTOLOOP is not set CONFIG_BLK_DEV_NBD=y # CONFIG_BLK_DEV_SX8 is not set # CONFIG_BLK_DEV_UB is not set # CONFIG_BLK_DEV_RAM is not set # CONFIG_BLK_DEV_INITRD is not set # CONFIG_CDROM_PKTCDVD is not set # CONFIG_ATA_OVER_ETH is not set # # Misc devices # # CONFIG_IBM_ASM is not set # CONFIG_SGI_IOC4 is not set # CONFIG_TIFM_CORE is not set # CONFIG_MSI_LAPTOP is not set # # ATA/ATAPI/MFM/RLL support # # CONFIG_IDE is not set # # SCSI device support # # CONFIG_RAID_ATTRS is not set CONFIG_SCSI=y # CONFIG_SCSI_TGT is not set # CONFIG_SCSI_NETLINK is not set CONFIG_SCSI_PROC_FS=y # # SCSI support type (disk, tape, CD-ROM) # CONFIG_BLK_DEV_SD=y # CONFIG_CHR_DEV_ST is not set # CONFIG_CHR_DEV_OSST is not set CONFIG_BLK_DEV_SR=y # CONFIG_BLK_DEV_SR_VENDOR is not set CONFIG_CHR_DEV_SG=y # CONFIG_CHR_DEV_SCH is not set # # Some SCSI devices (e.g. CD jukebox) support multiple LUNs # CONFIG_SCSI_MULTI_LUN=y CONFIG_SCSI_CONSTANTS=y # CONFIG_SCSI_LOGGING is not set CONFIG_SCSI_SCAN_ASYNC=y # # SCSI Transports # # CONFIG_SCSI_SPI_ATTRS is not set # CONFIG_SCSI_FC_ATTRS is not set # CONFIG_SCSI_ISCSI_ATTRS is not set # CONFIG_SCSI_SAS_ATTRS is not set # CONFIG_SCSI_SAS_LIBSAS is not set # # SCSI low-level drivers # # CONFIG_ISCSI_TCP is not set # CONFIG_BLK_DEV_3W_XXXX_RAID is not set # CONFIG_SCSI_3W_9XXX is not set # CONFIG_SCSI_7000FASST is not set # CONFIG_SCSI_ACARD is not set # CONFIG_SCSI_AHA152X is not set # CONFIG_SCSI_AHA1542 is not set # CONFIG_SCSI_AACRAID is not set # CONFIG_SCSI_AIC7XXX is not set # CONFIG_SCSI_AIC7XXX_OLD is not set # CONFIG_SCSI_AIC79XX is not set # CONFIG_SCSI_AIC94XX is not set # CONFIG_SCSI_DPT_I2O is not set # CONFIG_SCSI_ADVANSYS is not set # CONFIG_SCSI_IN2000 is not set # CONFIG_SCSI_ARCMSR is not set # CONFIG_MEGARAID_NEWGEN is not set # CONFIG_MEGARAID_LEGACY is not set # CONFIG_MEGARAID_SAS is not set # CONFIG_SCSI_HPTIOP is not set # CONFIG_SCSI_BUSLOGIC is not set # CONFIG_SCSI_DMX3191D is not set # CONFIG_SCSI_DTC3280 is not set # CONFIG_SCSI_EATA is not set # CONFIG_SCSI_FUTURE_DOMAIN is not set # CONFIG_SCSI_GDTH is not set # CONFIG_SCSI_GENERIC_NCR5380 is not set # CONFIG_SCSI_GENERIC_NCR5380_MMIO is not set # CONFIG_SCSI_IPS is not set # CONFIG_SCSI_INITIO is not set # CONFIG_SCSI_INIA100 is not set # CONFIG_SCSI_PPA is not set # CONFIG_SCSI_IMM is not set # CONFIG_SCSI_NCR53C406A is not set # CONFIG_SCSI_STEX is not set # CONFIG_SCSI_SYM53C8XX_2 is not set # CONFIG_SCSI_IPR is not set # CONFIG_SCSI_PAS16 is not set # CONFIG_SCSI_PSI240I is not set # CONFIG_SCSI_QLOGIC_FAS is not set # CONFIG_SCSI_QLOGIC_1280 is not set # CONFIG_SCSI_QLA_FC is not set # CONFIG_SCSI_QLA_ISCSI is not set # CONFIG_SCSI_LPFC is not set # CONFIG_SCSI_SYM53C416 is not set # CONFIG_SCSI_DC395x is not set # CONFIG_SCSI_DC390T is not set # CONFIG_SCSI_T128 is not set # CONFIG_SCSI_U14_34F is not set # CONFIG_SCSI_ULTRASTOR is not set # CONFIG_SCSI_NSP32 is not set # CONFIG_SCSI_DEBUG is not set # CONFIG_SCSI_SRP is not set # # PCMCIA SCSI adapter support # # CONFIG_PCMCIA_AHA152X is not set # CONFIG_PCMCIA_FDOMAIN is not set # CONFIG_PCMCIA_NINJA_SCSI is not set # CONFIG_PCMCIA_QLOGIC is not set # CONFIG_PCMCIA_SYM53C500 is not set # # Serial ATA (prod) and Parallel ATA (experimental) drivers # CONFIG_ATA=y CONFIG_SATA_AHCI=y # CONFIG_SATA_SVW is not set CONFIG_ATA_PIIX=y # CONFIG_SATA_MV is not set # CONFIG_SATA_NV is not set # CONFIG_PDC_ADMA is not set # CONFIG_SATA_QSTOR is not set # CONFIG_SATA_PROMISE is not set # CONFIG_SATA_SX4 is not set # CONFIG_SATA_SIL is not set # CONFIG_SATA_SIL24 is not set # CONFIG_SATA_SIS is not set # CONFIG_SATA_ULI is not set # CONFIG_SATA_VIA is not set # CONFIG_SATA_VITESSE is not set # CONFIG_PATA_ALI is not set # CONFIG_PATA_AMD is not set # CONFIG_PATA_ARTOP is not set # CONFIG_PATA_ATIIXP is not set # CONFIG_PATA_CMD64X is not set # CONFIG_PATA_CS5520 is not set # CONFIG_PATA_CS5530 is not set # CONFIG_PATA_CS5535 is not set # CONFIG_PATA_CYPRESS is not set # CONFIG_PATA_EFAR is not set # CONFIG_ATA_GENERIC is not set # CONFIG_PATA_HPT366 is not set # CONFIG_PATA_HPT37X is not set # CONFIG_PATA_HPT3X2N is not set # CONFIG_PATA_HPT3X3 is not set # CONFIG_PATA_IT821X is not set # CONFIG_PATA_JMICRON is not set # CONFIG_PATA_LEGACY is not set # CONFIG_PATA_TRIFLEX is not set # CONFIG_PATA_MARVELL is not set # CONFIG_PATA_MPIIX is not set # CONFIG_PATA_OLDPIIX is not set # CONFIG_PATA_NETCELL is not set # CONFIG_PATA_NS87410 is not set # CONFIG_PATA_OPTI is not set # CONFIG_PATA_OPTIDMA is not set # CONFIG_PATA_PCMCIA is not set # CONFIG_PATA_PDC_OLD is not set # CONFIG_PATA_QDI is not set # CONFIG_PATA_RADISYS is not set # CONFIG_PATA_RZ1000 is not set # CONFIG_PATA_SC1200 is not set # CONFIG_PATA_SERVERWORKS is not set # CONFIG_PATA_PDC2027X is not set # CONFIG_PATA_SIL680 is not set # CONFIG_PATA_SIS is not set # CONFIG_PATA_VIA is not set # CONFIG_PATA_WINBOND is not set # CONFIG_PATA_WINBOND_VLB is not set # # Old CD-ROM drivers (not SCSI, not IDE) # # CONFIG_CD_NO_IDESCSI is not set # # Multi-device support (RAID and LVM) # CONFIG_MD=y # CONFIG_BLK_DEV_MD is not set CONFIG_BLK_DEV_DM=y # CONFIG_DM_DEBUG is not set CONFIG_DM_CRYPT=y CONFIG_DM_SNAPSHOT=y # CONFIG_DM_MIRROR is not set # CONFIG_DM_ZERO is not set # CONFIG_DM_MULTIPATH is not set # # Fusion MPT device support # # CONFIG_FUSION is not set # CONFIG_FUSION_SPI is not set # CONFIG_FUSION_FC is not set # CONFIG_FUSION_SAS is not set # # IEEE 1394 (FireWire) support # CONFIG_IEEE1394=y # # Subsystem Options # # CONFIG_IEEE1394_VERBOSEDEBUG is not set # CONFIG_IEEE1394_OUI_DB is not set CONFIG_IEEE1394_EXTRA_CONFIG_ROMS=y CONFIG_IEEE1394_CONFIG_ROM_IP1394=y # CONFIG_IEEE1394_EXPORT_FULL_API is not set # # Device Drivers # # CONFIG_IEEE1394_PCILYNX is not set CONFIG_IEEE1394_OHCI1394=m # # Protocol Drivers # # CONFIG_IEEE1394_VIDEO1394 is not set CONFIG_IEEE1394_SBP2=y # CONFIG_IEEE1394_SBP2_PHYS_DMA is not set CONFIG_IEEE1394_ETH1394=y # CONFIG_IEEE1394_DV1394 is not set CONFIG_IEEE1394_RAWIO=y # # I2O device support # # CONFIG_I2O is not set # # Network device support # CONFIG_NETDEVICES=y # CONFIG_DUMMY is not set CONFIG_BONDING=y # CONFIG_EQUALIZER is not set CONFIG_TUN=y # CONFIG_NET_SB1000 is not set # # ARCnet devices # # CONFIG_ARCNET is not set # # PHY device support # # CONFIG_PHYLIB is not set # # Ethernet (10 or 100Mbit) # CONFIG_NET_ETHERNET=y CONFIG_MII=y # CONFIG_HAPPYMEAL is not set # CONFIG_SUNGEM is not set # CONFIG_CASSINI is not set # CONFIG_NET_VENDOR_3COM is not set # CONFIG_LANCE is not set # CONFIG_NET_VENDOR_SMC is not set # CONFIG_NET_VENDOR_RACAL is not set # # Tulip family network device support # # CONFIG_NET_TULIP is not set # CONFIG_AT1700 is not set # CONFIG_DEPCA is not set # CONFIG_HP100 is not set # CONFIG_NET_ISA is not set CONFIG_NET_PCI=y CONFIG_PCNET32=y # CONFIG_PCNET32_NAPI is not set CONFIG_AMD8111_ETH=y CONFIG_AMD8111E_NAPI=y # CONFIG_ADAPTEC_STARFIRE is not set # CONFIG_AC3200 is not set # CONFIG_APRICOT is not set # CONFIG_B44 is not set # CONFIG_FORCEDETH is not set # CONFIG_CS89x0 is not set # CONFIG_DGRS is not set # CONFIG_EEPRO100 is not set CONFIG_E100=y # CONFIG_FEALNX is not set # CONFIG_NATSEMI is not set # CONFIG_NE2K_PCI is not set # CONFIG_8139CP is not set CONFIG_8139TOO=y CONFIG_8139TOO_PIO=y # CONFIG_8139TOO_TUNE_TWISTER is not set # CONFIG_8139TOO_8129 is not set # CONFIG_8139_OLD_RX_RESET is not set # CONFIG_SIS900 is not set # CONFIG_EPIC100 is not set # CONFIG_SUNDANCE is not set # CONFIG_TLAN is not set # CONFIG_VIA_RHINE is not set # CONFIG_NET_POCKET is not set # # Ethernet (1000 Mbit) # # CONFIG_ACENIC is not set # CONFIG_DL2K is not set # CONFIG_E1000 is not set # CONFIG_NS83820 is not set # CONFIG_HAMACHI is not set # CONFIG_YELLOWFIN is not set # CONFIG_R8169 is not set # CONFIG_SIS190 is not set # CONFIG_SKGE is not set # CONFIG_SKY2 is not set # CONFIG_SK98LIN is not set # CONFIG_VIA_VELOCITY is not set CONFIG_TIGON3=y # CONFIG_BNX2 is not set # CONFIG_QLA3XXX is not set # # Ethernet (10000 Mbit) # # CONFIG_CHELSIO_T1 is not set # CONFIG_IXGB is not set CONFIG_S2IO=m # CONFIG_S2IO_NAPI is not set # CONFIG_MYRI10GE is not set # CONFIG_NETXEN_NIC is not set # # Token Ring devices # # CONFIG_TR is not set # # Wireless LAN (non-hamradio) # CONFIG_NET_RADIO=y CONFIG_NET_WIRELESS_RTNETLINK=y # # Obsolete Wireless cards support (pre-802.11) # # CONFIG_STRIP is not set # CONFIG_ARLAN is not set # CONFIG_WAVELAN is not set # CONFIG_PCMCIA_WAVELAN is not set # CONFIG_PCMCIA_NETWAVE is not set # # Wireless 802.11 Frequency Hopping cards support # # CONFIG_PCMCIA_RAYCS is not set # # Wireless 802.11b ISA/PCI cards support # # CONFIG_IPW2100 is not set CONFIG_IPW2200=m CONFIG_IPW2200_MONITOR=y CONFIG_IPW2200_RADIOTAP=y CONFIG_IPW2200_PROMISCUOUS=y CONFIG_IPW2200_QOS=y # CONFIG_IPW2200_DEBUG is not set # CONFIG_AIRO is not set # CONFIG_HERMES is not set # CONFIG_ATMEL is not set # # Wireless 802.11b Pcmcia/Cardbus cards support # # CONFIG_AIRO_CS is not set # CONFIG_PCMCIA_WL3501 is not set # # Prism GT/Duette 802.11(a/b/g) PCI/Cardbus support # # CONFIG_PRISM54 is not set # CONFIG_USB_ZD1201 is not set CONFIG_HOSTAP=m CONFIG_HOSTAP_FIRMWARE=y CONFIG_HOSTAP_FIRMWARE_NVRAM=y # CONFIG_HOSTAP_PLX is not set # CONFIG_HOSTAP_PCI is not set CONFIG_HOSTAP_CS=m # CONFIG_BCM43XX is not set CONFIG_ZD1211RW=m # CONFIG_ZD1211RW_DEBUG is not set CONFIG_NET_WIRELESS=y # # PCMCIA network device support # # CONFIG_NET_PCMCIA is not set # # Wan interfaces # # CONFIG_WAN is not set # CONFIG_FDDI is not set # CONFIG_HIPPI is not set # CONFIG_PLIP is not set CONFIG_PPP=y CONFIG_PPP_MULTILINK=y CONFIG_PPP_FILTER=y CONFIG_PPP_ASYNC=y CONFIG_PPP_SYNC_TTY=y CONFIG_PPP_DEFLATE=y CONFIG_PPP_BSDCOMP=y # CONFIG_PPP_MPPE is not set CONFIG_PPPOE=y # CONFIG_SLIP is not set CONFIG_SLHC=y # CONFIG_NET_FC is not set # CONFIG_SHAPER is not set CONFIG_NETCONSOLE=y CONFIG_NETPOLL=y # CONFIG_NETPOLL_RX is not set # CONFIG_NETPOLL_TRAP is not set CONFIG_NET_POLL_CONTROLLER=y # # ISDN subsystem # # CONFIG_ISDN is not set # # Telephony Support # # CONFIG_PHONE is not set # # Input device support # CONFIG_INPUT=y # CONFIG_INPUT_FF_MEMLESS is not set # # Userland interfaces # CONFIG_INPUT_MOUSEDEV=y CONFIG_INPUT_MOUSEDEV_PSAUX=y CONFIG_INPUT_MOUSEDEV_SCREEN_X=1024 CONFIG_INPUT_MOUSEDEV_SCREEN_Y=768 # CONFIG_INPUT_JOYDEV is not set # CONFIG_INPUT_TSDEV is not set # CONFIG_INPUT_EVDEV is not set # CONFIG_INPUT_EVBUG is not set # # Input Device Drivers # CONFIG_INPUT_KEYBOARD=y CONFIG_KEYBOARD_ATKBD=y # CONFIG_KEYBOARD_SUNKBD is not set # CONFIG_KEYBOARD_LKKBD is not set # CONFIG_KEYBOARD_XTKBD is not set # CONFIG_KEYBOARD_NEWTON is not set # CONFIG_KEYBOARD_STOWAWAY is not set CONFIG_INPUT_MOUSE=y CONFIG_MOUSE_PS2=y # CONFIG_MOUSE_SERIAL is not set # CONFIG_MOUSE_INPORT is not set # CONFIG_MOUSE_LOGIBM is not set # CONFIG_MOUSE_PC110PAD is not set # CONFIG_MOUSE_VSXXXAA is not set # CONFIG_INPUT_JOYSTICK is not set # CONFIG_INPUT_TOUCHSCREEN is not set CONFIG_INPUT_MISC=y # CONFIG_INPUT_PCSPKR is not set # CONFIG_INPUT_WISTRON_BTNS is not set CONFIG_INPUT_UINPUT=m # # Hardware I/O ports # CONFIG_SERIO=y CONFIG_SERIO_I8042=y # CONFIG_SERIO_SERPORT is not set # CONFIG_SERIO_CT82C710 is not set # CONFIG_SERIO_PARKBD is not set # CONFIG_SERIO_PCIPS2 is not set CONFIG_SERIO_LIBPS2=y # CONFIG_SERIO_RAW is not set # CONFIG_GAMEPORT is not set # # Character devices # CONFIG_VT=y CONFIG_VT_CONSOLE=y CONFIG_HW_CONSOLE=y # CONFIG_VT_HW_CONSOLE_BINDING is not set # CONFIG_SERIAL_NONSTANDARD is not set # # Serial drivers # CONFIG_SERIAL_8250=y # CONFIG_SERIAL_8250_CONSOLE is not set CONFIG_SERIAL_8250_PCI=y CONFIG_SERIAL_8250_PNP=y # CONFIG_SERIAL_8250_CS is not set CONFIG_SERIAL_8250_NR_UARTS=4 CONFIG_SERIAL_8250_RUNTIME_UARTS=4 # CONFIG_SERIAL_8250_EXTENDED is not set # # Non-8250 serial port support # CONFIG_SERIAL_CORE=y # CONFIG_SERIAL_JSM is not set CONFIG_UNIX98_PTYS=y # CONFIG_LEGACY_PTYS is not set CONFIG_PRINTER=m # CONFIG_LP_CONSOLE is not set CONFIG_PPDEV=m # CONFIG_TIPAR is not set # # IPMI # # CONFIG_IPMI_HANDLER is not set # # Watchdog Cards # # CONFIG_WATCHDOG is not set CONFIG_HW_RANDOM=y CONFIG_HW_RANDOM_INTEL=y CONFIG_HW_RANDOM_AMD=y CONFIG_HW_RANDOM_GEODE=y CONFIG_HW_RANDOM_VIA=y # CONFIG_NVRAM is not set CONFIG_RTC=y # CONFIG_DTLK is not set # CONFIG_R3964 is not set # CONFIG_APPLICOM is not set # CONFIG_SONYPI is not set CONFIG_AGP=y # CONFIG_AGP_ALI is not set # CONFIG_AGP_ATI is not set # CONFIG_AGP_AMD is not set # CONFIG_AGP_AMD64 is not set CONFIG_AGP_INTEL=y # CONFIG_AGP_NVIDIA is not set # CONFIG_AGP_SIS is not set # CONFIG_AGP_SWORKS is not set # CONFIG_AGP_VIA is not set # CONFIG_AGP_EFFICEON is not set CONFIG_DRM=y # CONFIG_DRM_TDFX is not set # CONFIG_DRM_R128 is not set CONFIG_DRM_RADEON=m # CONFIG_DRM_I810 is not set # CONFIG_DRM_I830 is not set # CONFIG_DRM_I915 is not set # CONFIG_DRM_MGA is not set # CONFIG_DRM_SIS is not set # CONFIG_DRM_VIA is not set # CONFIG_DRM_SAVAGE is not set # # PCMCIA character devices # # CONFIG_SYNCLINK_CS is not set # CONFIG_CARDMAN_4000 is not set # CONFIG_CARDMAN_4040 is not set # CONFIG_MWAVE is not set # CONFIG_PC8736x_GPIO is not set # CONFIG_NSC_GPIO is not set # CONFIG_CS5535_GPIO is not set # CONFIG_RAW_DRIVER is not set CONFIG_HPET=y # CONFIG_HPET_RTC_IRQ is not set CONFIG_HPET_MMAP=y # CONFIG_HANGCHECK_TIMER is not set # # TPM devices # CONFIG_TCG_TPM=y CONFIG_TCG_TIS=y CONFIG_TCG_NSC=y CONFIG_TCG_ATMEL=y CONFIG_TCG_INFINEON=y # CONFIG_TELCLOCK is not set # # I2C support # CONFIG_I2C=y CONFIG_I2C_CHARDEV=y # # I2C Algorithms # CONFIG_I2C_ALGOBIT=y # CONFIG_I2C_ALGOPCF is not set # CONFIG_I2C_ALGOPCA is not set # # I2C Hardware Bus support # # CONFIG_I2C_ALI1535 is not set # CONFIG_I2C_ALI1563 is not set # CONFIG_I2C_ALI15X3 is not set # CONFIG_I2C_AMD756 is not set # CONFIG_I2C_AMD8111 is not set # CONFIG_I2C_ELEKTOR is not set CONFIG_I2C_I801=y CONFIG_I2C_I810=y # CONFIG_I2C_PIIX4 is not set # CONFIG_I2C_NFORCE2 is not set # CONFIG_I2C_OCORES is not set # CONFIG_I2C_PARPORT is not set # CONFIG_I2C_PARPORT_LIGHT is not set # CONFIG_I2C_PROSAVAGE is not set # CONFIG_I2C_SAVAGE4 is not set # CONFIG_SCx200_ACB is not set # CONFIG_I2C_SIS5595 is not set # CONFIG_I2C_SIS630 is not set # CONFIG_I2C_SIS96X is not set # CONFIG_I2C_STUB is not set # CONFIG_I2C_VIA is not set # CONFIG_I2C_VIAPRO is not set # CONFIG_I2C_VOODOO3 is not set # CONFIG_I2C_PCA_ISA is not set # # Miscellaneous I2C Chip support # # CONFIG_SENSORS_DS1337 is not set # CONFIG_SENSORS_DS1374 is not set CONFIG_SENSORS_EEPROM=m # CONFIG_SENSORS_PCF8574 is not set # CONFIG_SENSORS_PCA9539 is not set # CONFIG_SENSORS_PCF8591 is not set # CONFIG_SENSORS_MAX6875 is not set # CONFIG_I2C_DEBUG_CORE is not set # CONFIG_I2C_DEBUG_ALGO is not set # CONFIG_I2C_DEBUG_BUS is not set # CONFIG_I2C_DEBUG_CHIP is not set # # SPI support # # CONFIG_SPI is not set # CONFIG_SPI_MASTER is not set # # Dallas's 1-wire bus # # CONFIG_W1 is not set # # Hardware Monitoring support # CONFIG_HWMON=y # CONFIG_HWMON_VID is not set # CONFIG_SENSORS_ABITUGURU is not set # CONFIG_SENSORS_ADM1021 is not set # CONFIG_SENSORS_ADM1025 is not set # CONFIG_SENSORS_ADM1026 is not set # CONFIG_SENSORS_ADM1031 is not set # CONFIG_SENSORS_ADM9240 is not set # CONFIG_SENSORS_K8TEMP is not set # CONFIG_SENSORS_ASB100 is not set # CONFIG_SENSORS_ATXP1 is not set # CONFIG_SENSORS_DS1621 is not set # CONFIG_SENSORS_F71805F is not set # CONFIG_SENSORS_FSCHER is not set # CONFIG_SENSORS_FSCPOS is not set # CONFIG_SENSORS_GL518SM is not set # CONFIG_SENSORS_GL520SM is not set # CONFIG_SENSORS_IT87 is not set # CONFIG_SENSORS_LM63 is not set # CONFIG_SENSORS_LM75 is not set # CONFIG_SENSORS_LM77 is not set # CONFIG_SENSORS_LM78 is not set # CONFIG_SENSORS_LM80 is not set # CONFIG_SENSORS_LM83 is not set # CONFIG_SENSORS_LM85 is not set # CONFIG_SENSORS_LM87 is not set # CONFIG_SENSORS_LM90 is not set # CONFIG_SENSORS_LM92 is not set # CONFIG_SENSORS_MAX1619 is not set # CONFIG_SENSORS_PC87360 is not set # CONFIG_SENSORS_PC87427 is not set # CONFIG_SENSORS_SIS5595 is not set # CONFIG_SENSORS_SMSC47M1 is not set # CONFIG_SENSORS_SMSC47M192 is not set # CONFIG_SENSORS_SMSC47B397 is not set # CONFIG_SENSORS_VIA686A is not set # CONFIG_SENSORS_VT1211 is not set # CONFIG_SENSORS_VT8231 is not set # CONFIG_SENSORS_W83781D is not set # CONFIG_SENSORS_W83791D is not set # CONFIG_SENSORS_W83792D is not set # CONFIG_SENSORS_W83793 is not set # CONFIG_SENSORS_W83L785TS is not set # CONFIG_SENSORS_W83627HF is not set # CONFIG_SENSORS_W83627EHF is not set CONFIG_SENSORS_HDAPS=m # CONFIG_HWMON_DEBUG_CHIP is not set # # Multimedia devices # # CONFIG_VIDEO_DEV is not set # # Digital Video Broadcasting Devices # # CONFIG_DVB is not set # CONFIG_USB_DABUSB is not set # # Graphics support # CONFIG_FIRMWARE_EDID=y CONFIG_FB=m CONFIG_FB_DDC=m CONFIG_FB_CFB_FILLRECT=m CONFIG_FB_CFB_COPYAREA=m CONFIG_FB_CFB_IMAGEBLIT=m # CONFIG_FB_MACMODES is not set # CONFIG_FB_BACKLIGHT is not set CONFIG_FB_MODE_HELPERS=y # CONFIG_FB_TILEBLITTING is not set # CONFIG_FB_CIRRUS is not set # CONFIG_FB_PM2 is not set # CONFIG_FB_CYBER2000 is not set # CONFIG_FB_ARC is not set # CONFIG_FB_VGA16 is not set # CONFIG_FB_HGA is not set # CONFIG_FB_S1D13XXX is not set # CONFIG_FB_NVIDIA is not set # CONFIG_FB_RIVA is not set # CONFIG_FB_I810 is not set # CONFIG_FB_INTEL is not set # CONFIG_FB_MATROX is not set CONFIG_FB_RADEON=m CONFIG_FB_RADEON_I2C=y CONFIG_FB_RADEON_DEBUG=y # CONFIG_FB_ATY128 is not set # CONFIG_FB_ATY is not set # CONFIG_FB_SAVAGE is not set # CONFIG_FB_SIS is not set # CONFIG_FB_NEOMAGIC is not set # CONFIG_FB_KYRO is not set # CONFIG_FB_3DFX is not set # CONFIG_FB_VOODOO1 is not set # CONFIG_FB_CYBLA is not set # CONFIG_FB_TRIDENT is not set # CONFIG_FB_GEODE is not set # CONFIG_FB_VIRTUAL is not set # # Console display driver support # CONFIG_VGA_CONSOLE=y CONFIG_VGACON_SOFT_SCROLLBACK=y CONFIG_VGACON_SOFT_SCROLLBACK_SIZE=64 CONFIG_VIDEO_SELECT=y # CONFIG_MDA_CONSOLE is not set CONFIG_DUMMY_CONSOLE=y CONFIG_FRAMEBUFFER_CONSOLE=m # CONFIG_FRAMEBUFFER_CONSOLE_ROTATION is not set CONFIG_FONTS=y # CONFIG_FONT_8x8 is not set # CONFIG_FONT_8x16 is not set # CONFIG_FONT_6x11 is not set # CONFIG_FONT_7x14 is not set # CONFIG_FONT_PEARL_8x8 is not set # CONFIG_FONT_ACORN_8x8 is not set # CONFIG_FONT_MINI_4x6 is not set # CONFIG_FONT_SUN8x16 is not set CONFIG_FONT_SUN12x22=y # CONFIG_FONT_10x18 is not set # # Logo configuration # # CONFIG_LOGO is not set CONFIG_BACKLIGHT_LCD_SUPPORT=y CONFIG_BACKLIGHT_CLASS_DEVICE=m CONFIG_BACKLIGHT_DEVICE=y CONFIG_LCD_CLASS_DEVICE=m CONFIG_LCD_DEVICE=y # # Sound # CONFIG_SOUND=y # # Advanced Linux Sound Architecture # CONFIG_SND=y CONFIG_SND_TIMER=y CONFIG_SND_PCM=y # CONFIG_SND_SEQUENCER is not set CONFIG_SND_OSSEMUL=y CONFIG_SND_MIXER_OSS=y CONFIG_SND_PCM_OSS=y # CONFIG_SND_PCM_OSS_PLUGINS is not set CONFIG_SND_RTCTIMER=y # CONFIG_SND_DYNAMIC_MINORS is not set # CONFIG_SND_SUPPORT_OLD_API is not set CONFIG_SND_VERBOSE_PROCFS=y # CONFIG_SND_VERBOSE_PRINTK is not set # CONFIG_SND_DEBUG is not set # # Generic devices # CONFIG_SND_AC97_CODEC=y # CONFIG_SND_DUMMY is not set # CONFIG_SND_MTPAV is not set # CONFIG_SND_MTS64 is not set # CONFIG_SND_SERIAL_U16550 is not set # CONFIG_SND_MPU401 is not set # # ISA devices # # CONFIG_SND_ADLIB is not set # CONFIG_SND_AD1816A is not set # CONFIG_SND_AD1848 is not set # CONFIG_SND_ALS100 is not set # CONFIG_SND_AZT2320 is not set # CONFIG_SND_CMI8330 is not set # CONFIG_SND_CS4231 is not set # CONFIG_SND_CS4232 is not set # CONFIG_SND_CS4236 is not set # CONFIG_SND_DT019X is not set # CONFIG_SND_ES968 is not set # CONFIG_SND_ES1688 is not set # CONFIG_SND_ES18XX is not set # CONFIG_SND_GUSCLASSIC is not set # CONFIG_SND_GUSEXTREME is not set # CONFIG_SND_GUSMAX is not set # CONFIG_SND_INTERWAVE is not set # CONFIG_SND_INTERWAVE_STB is not set # CONFIG_SND_OPL3SA2 is not set # CONFIG_SND_OPTI92X_AD1848 is not set # CONFIG_SND_OPTI92X_CS4231 is not set # CONFIG_SND_OPTI93X is not set # CONFIG_SND_MIRO is not set # CONFIG_SND_SB8 is not set # CONFIG_SND_SB16 is not set # CONFIG_SND_SBAWE is not set # CONFIG_SND_SGALAXY is not set # CONFIG_SND_SSCAPE is not set # CONFIG_SND_WAVEFRONT is not set # # PCI devices # # CONFIG_SND_AD1889 is not set # CONFIG_SND_ALS300 is not set # CONFIG_SND_ALS4000 is not set # CONFIG_SND_ALI5451 is not set # CONFIG_SND_ATIIXP is not set # CONFIG_SND_ATIIXP_MODEM is not set # CONFIG_SND_AU8810 is not set # CONFIG_SND_AU8820 is not set # CONFIG_SND_AU8830 is not set # CONFIG_SND_AZT3328 is not set # CONFIG_SND_BT87X is not set # CONFIG_SND_CA0106 is not set # CONFIG_SND_CMIPCI is not set # CONFIG_SND_CS4281 is not set # CONFIG_SND_CS46XX is not set # CONFIG_SND_CS5535AUDIO is not set # CONFIG_SND_DARLA20 is not set # CONFIG_SND_GINA20 is not set # CONFIG_SND_LAYLA20 is not set # CONFIG_SND_DARLA24 is not set # CONFIG_SND_GINA24 is not set # CONFIG_SND_LAYLA24 is not set # CONFIG_SND_MONA is not set # CONFIG_SND_MIA is not set # CONFIG_SND_ECHO3G is not set # CONFIG_SND_INDIGO is not set # CONFIG_SND_INDIGOIO is not set # CONFIG_SND_INDIGODJ is not set # CONFIG_SND_EMU10K1 is not set # CONFIG_SND_EMU10K1X is not set # CONFIG_SND_ENS1370 is not set # CONFIG_SND_ENS1371 is not set # CONFIG_SND_ES1938 is not set # CONFIG_SND_ES1968 is not set # CONFIG_SND_FM801 is not set CONFIG_SND_HDA_INTEL=y # CONFIG_SND_HDSP is not set # CONFIG_SND_HDSPM is not set # CONFIG_SND_ICE1712 is not set # CONFIG_SND_ICE1724 is not set CONFIG_SND_INTEL8X0=y # CONFIG_SND_INTEL8X0M is not set # CONFIG_SND_KORG1212 is not set # CONFIG_SND_MAESTRO3 is not set # CONFIG_SND_MIXART is not set # CONFIG_SND_NM256 is not set # CONFIG_SND_PCXHR is not set # CONFIG_SND_RIPTIDE is not set # CONFIG_SND_RME32 is not set # CONFIG_SND_RME96 is not set # CONFIG_SND_RME9652 is not set # CONFIG_SND_SONICVIBES is not set # CONFIG_SND_TRIDENT is not set # CONFIG_SND_VIA82XX is not set # CONFIG_SND_VIA82XX_MODEM is not set # CONFIG_SND_VX222 is not set # CONFIG_SND_YMFPCI is not set CONFIG_SND_AC97_POWER_SAVE=y # # USB devices # # CONFIG_SND_USB_AUDIO is not set # CONFIG_SND_USB_USX2Y is not set # # PCMCIA devices # # CONFIG_SND_VXPOCKET is not set # CONFIG_SND_PDAUDIOCF is not set # # Open Sound System # # CONFIG_SOUND_PRIME is not set CONFIG_AC97_BUS=y # # HID Devices # CONFIG_HID=y # # USB support # CONFIG_USB_ARCH_HAS_HCD=y CONFIG_USB_ARCH_HAS_OHCI=y CONFIG_USB_ARCH_HAS_EHCI=y CONFIG_USB=y # CONFIG_USB_DEBUG is not set # # Miscellaneous USB options # CONFIG_USB_DEVICEFS=y # CONFIG_USB_BANDWIDTH is not set # CONFIG_USB_DYNAMIC_MINORS is not set # CONFIG_USB_SUSPEND is not set CONFIG_USB_MULTITHREAD_PROBE=y # CONFIG_USB_OTG is not set # # USB Host Controller Drivers # CONFIG_USB_EHCI_HCD=m # CONFIG_USB_EHCI_SPLIT_ISO is not set # CONFIG_USB_EHCI_ROOT_HUB_TT is not set # CONFIG_USB_EHCI_TT_NEWSCHED is not set # CONFIG_USB_ISP116X_HCD is not set # CONFIG_USB_OHCI_HCD is not set CONFIG_USB_UHCI_HCD=m # CONFIG_USB_SL811_HCD is not set # # USB Device Class drivers # # CONFIG_USB_ACM is not set CONFIG_USB_PRINTER=y # # NOTE: USB_STORAGE enables SCSI, and 'SCSI disk support' # # # may also be needed; see USB_STORAGE Help for more information # CONFIG_USB_STORAGE=m # CONFIG_USB_STORAGE_DEBUG is not set # CONFIG_USB_STORAGE_DATAFAB is not set # CONFIG_USB_STORAGE_FREECOM is not set # CONFIG_USB_STORAGE_DPCM is not set # CONFIG_USB_STORAGE_USBAT is not set # CONFIG_USB_STORAGE_SDDR09 is not set # CONFIG_USB_STORAGE_SDDR55 is not set # CONFIG_USB_STORAGE_JUMPSHOT is not set # CONFIG_USB_STORAGE_ALAUDA is not set # CONFIG_USB_STORAGE_KARMA is not set CONFIG_USB_LIBUSUAL=y # # USB Input Devices # CONFIG_USB_HID=y # CONFIG_USB_HID_POWERBOOK is not set # CONFIG_HID_FF is not set # CONFIG_USB_HIDDEV is not set # CONFIG_USB_AIPTEK is not set # CONFIG_USB_WACOM is not set # CONFIG_USB_ACECAD is not set # CONFIG_USB_KBTAB is not set # CONFIG_USB_POWERMATE is not set # CONFIG_USB_TOUCHSCREEN is not set # CONFIG_USB_YEALINK is not set # CONFIG_USB_XPAD is not set # CONFIG_USB_ATI_REMOTE is not set # CONFIG_USB_ATI_REMOTE2 is not set # CONFIG_USB_KEYSPAN_REMOTE is not set # CONFIG_USB_APPLETOUCH is not set # # USB Imaging devices # # CONFIG_USB_MDC800 is not set # CONFIG_USB_MICROTEK is not set # # USB Network Adapters # # CONFIG_USB_CATC is not set # CONFIG_USB_KAWETH is not set # CONFIG_USB_PEGASUS is not set # CONFIG_USB_RTL8150 is not set CONFIG_USB_USBNET_MII=y CONFIG_USB_USBNET=y CONFIG_USB_NET_AX8817X=y CONFIG_USB_NET_CDCETHER=m # CONFIG_USB_NET_GL620A is not set CONFIG_USB_NET_NET1080=m # CONFIG_USB_NET_PLUSB is not set # CONFIG_USB_NET_MCS7830 is not set CONFIG_USB_NET_RNDIS_HOST=m CONFIG_USB_NET_CDC_SUBSET=m # CONFIG_USB_ALI_M5632 is not set # CONFIG_USB_AN2720 is not set CONFIG_USB_BELKIN=y CONFIG_USB_ARMLINUX=y # CONFIG_USB_EPSON2888 is not set CONFIG_USB_NET_ZAURUS=m CONFIG_USB_MON=y # # USB port drivers # # CONFIG_USB_USS720 is not set # # USB Serial Converter support # CONFIG_USB_SERIAL=y # CONFIG_USB_SERIAL_CONSOLE is not set CONFIG_USB_SERIAL_GENERIC=y # CONFIG_USB_SERIAL_AIRCABLE is not set # CONFIG_USB_SERIAL_AIRPRIME is not set # CONFIG_USB_SERIAL_ARK3116 is not set # CONFIG_USB_SERIAL_BELKIN is not set # CONFIG_USB_SERIAL_WHITEHEAT is not set # CONFIG_USB_SERIAL_DIGI_ACCELEPORT is not set # CONFIG_USB_SERIAL_CP2101 is not set # CONFIG_USB_SERIAL_CYPRESS_M8 is not set # CONFIG_USB_SERIAL_EMPEG is not set # CONFIG_USB_SERIAL_FTDI_SIO is not set # CONFIG_USB_SERIAL_FUNSOFT is not set # CONFIG_USB_SERIAL_VISOR is not set # CONFIG_USB_SERIAL_IPAQ is not set # CONFIG_USB_SERIAL_IR is not set # CONFIG_USB_SERIAL_EDGEPORT is not set # CONFIG_USB_SERIAL_EDGEPORT_TI is not set # CONFIG_USB_SERIAL_GARMIN is not set # CONFIG_USB_SERIAL_IPW is not set # CONFIG_USB_SERIAL_KEYSPAN_PDA is not set # CONFIG_USB_SERIAL_KEYSPAN is not set # CONFIG_USB_SERIAL_KLSI is not set # CONFIG_USB_SERIAL_KOBIL_SCT is not set # CONFIG_USB_SERIAL_MCT_U232 is not set # CONFIG_USB_SERIAL_MOS7720 is not set # CONFIG_USB_SERIAL_MOS7840 is not set # CONFIG_USB_SERIAL_NAVMAN is not set CONFIG_USB_SERIAL_PL2303=y CONFIG_USB_SERIAL_HP4X=y # CONFIG_USB_SERIAL_SAFE is not set # CONFIG_USB_SERIAL_SIERRAWIRELESS is not set # CONFIG_USB_SERIAL_TI is not set # CONFIG_USB_SERIAL_CYBERJACK is not set # CONFIG_USB_SERIAL_XIRCOM is not set # CONFIG_USB_SERIAL_OPTION is not set # CONFIG_USB_SERIAL_OMNINET is not set # CONFIG_USB_SERIAL_DEBUG is not set # # USB Miscellaneous drivers # # CONFIG_USB_EMI62 is not set # CONFIG_USB_EMI26 is not set # CONFIG_USB_ADUTUX is not set # CONFIG_USB_AUERSWALD is not set # CONFIG_USB_RIO500 is not set # CONFIG_USB_LEGOTOWER is not set # CONFIG_USB_LCD is not set # CONFIG_USB_LED is not set # CONFIG_USB_CYPRESS_CY7C63 is not set # CONFIG_USB_CYTHERM is not set # CONFIG_USB_PHIDGET is not set # CONFIG_USB_IDMOUSE is not set # CONFIG_USB_FTDI_ELAN is not set # CONFIG_USB_APPLEDISPLAY is not set # CONFIG_USB_SISUSBVGA is not set # CONFIG_USB_LD is not set # CONFIG_USB_TRANCEVIBRATOR is not set # CONFIG_USB_TEST is not set # # USB DSL modem support # # # USB Gadget Support # # CONFIG_USB_GADGET is not set # # MMC/SD Card support # # CONFIG_MMC is not set # # LED devices # # CONFIG_NEW_LEDS is not set # # LED drivers # # # LED Triggers # # # InfiniBand support # # CONFIG_INFINIBAND is not set # # EDAC - error detection and reporting (RAS) (EXPERIMENTAL) # CONFIG_EDAC=y # # Reporting subsystems # # CONFIG_EDAC_DEBUG is not set CONFIG_EDAC_MM_EDAC=y # CONFIG_EDAC_AMD76X is not set # CONFIG_EDAC_E7XXX is not set # CONFIG_EDAC_E752X is not set # CONFIG_EDAC_I82875P is not set # CONFIG_EDAC_I82860 is not set # CONFIG_EDAC_R82600 is not set CONFIG_EDAC_POLL=y # # Real Time Clock # # CONFIG_RTC_CLASS is not set # # DMA Engine support # # CONFIG_DMA_ENGINE is not set # # DMA Clients # # # DMA Devices # # # Virtualization # # CONFIG_KVM is not set # # File systems # CONFIG_EXT2_FS=y CONFIG_EXT2_FS_XATTR=y CONFIG_EXT2_FS_POSIX_ACL=y CONFIG_EXT2_FS_SECURITY=y # CONFIG_EXT2_FS_XIP is not set CONFIG_EXT3_FS=y CONFIG_EXT3_FS_XATTR=y CONFIG_EXT3_FS_POSIX_ACL=y CONFIG_EXT3_FS_SECURITY=y # CONFIG_EXT4DEV_FS is not set CONFIG_JBD=y # CONFIG_JBD_DEBUG is not set CONFIG_FS_MBCACHE=y CONFIG_REISERFS_FS=y # CONFIG_REISERFS_CHECK is not set # CONFIG_REISERFS_PROC_INFO is not set # CONFIG_REISERFS_FS_XATTR is not set # CONFIG_JFS_FS is not set CONFIG_FS_POSIX_ACL=y # CONFIG_XFS_FS is not set # CONFIG_GFS2_FS is not set # CONFIG_OCFS2_FS is not set CONFIG_MINIX_FS=y # CONFIG_ROMFS_FS is not set CONFIG_INOTIFY=y CONFIG_INOTIFY_USER=y # CONFIG_QUOTA is not set CONFIG_DNOTIFY=y # CONFIG_AUTOFS_FS is not set CONFIG_AUTOFS4_FS=y # CONFIG_FUSE_FS is not set # # CD-ROM/DVD Filesystems # CONFIG_ISO9660_FS=y CONFIG_JOLIET=y # CONFIG_ZISOFS is not set CONFIG_UDF_FS=y CONFIG_UDF_NLS=y # # DOS/FAT/NT Filesystems # CONFIG_FAT_FS=y CONFIG_MSDOS_FS=y CONFIG_VFAT_FS=y CONFIG_FAT_DEFAULT_CODEPAGE=437 CONFIG_FAT_DEFAULT_IOCHARSET="iso8859-1" CONFIG_NTFS_FS=y # CONFIG_NTFS_DEBUG is not set CONFIG_NTFS_RW=y # # Pseudo filesystems # CONFIG_PROC_FS=y CONFIG_PROC_KCORE=y CONFIG_PROC_SYSCTL=y CONFIG_SYSFS=y CONFIG_TMPFS=y # CONFIG_TMPFS_POSIX_ACL is not set # CONFIG_HUGETLBFS is not set # CONFIG_HUGETLB_PAGE is not set CONFIG_RAMFS=y CONFIG_CONFIGFS_FS=y # # Miscellaneous filesystems # # CONFIG_ADFS_FS is not set # CONFIG_AFFS_FS is not set # CONFIG_HFS_FS is not set # CONFIG_HFSPLUS_FS is not set # CONFIG_BEFS_FS is not set # CONFIG_BFS_FS is not set # CONFIG_EFS_FS is not set CONFIG_JFFS2_FS=m CONFIG_JFFS2_FS_DEBUG=0 CONFIG_JFFS2_FS_WRITEBUFFER=y # CONFIG_JFFS2_SUMMARY is not set # CONFIG_JFFS2_FS_XATTR is not set CONFIG_JFFS2_COMPRESSION_OPTIONS=y CONFIG_JFFS2_ZLIB=y CONFIG_JFFS2_RTIME=y # CONFIG_JFFS2_RUBIN is not set # CONFIG_JFFS2_CMODE_NONE is not set CONFIG_JFFS2_CMODE_PRIORITY=y # CONFIG_JFFS2_CMODE_SIZE is not set CONFIG_CRAMFS=m # CONFIG_VXFS_FS is not set # CONFIG_HPFS_FS is not set # CONFIG_QNX4FS_FS is not set # CONFIG_SYSV_FS is not set # CONFIG_UFS_FS is not set # # Network File Systems # CONFIG_NFS_FS=y CONFIG_NFS_V3=y CONFIG_NFS_V3_ACL=y # CONFIG_NFS_V4 is not set CONFIG_NFS_DIRECTIO=y CONFIG_NFSD=y CONFIG_NFSD_V2_ACL=y CONFIG_NFSD_V3=y CONFIG_NFSD_V3_ACL=y # CONFIG_NFSD_V4 is not set CONFIG_NFSD_TCP=y CONFIG_LOCKD=y CONFIG_LOCKD_V4=y CONFIG_EXPORTFS=y CONFIG_NFS_ACL_SUPPORT=y CONFIG_NFS_COMMON=y CONFIG_SUNRPC=y # CONFIG_RPCSEC_GSS_KRB5 is not set # CONFIG_RPCSEC_GSS_SPKM3 is not set CONFIG_SMB_FS=y # CONFIG_SMB_NLS_DEFAULT is not set CONFIG_CIFS=y # CONFIG_CIFS_STATS is not set # CONFIG_CIFS_WEAK_PW_HASH is not set # CONFIG_CIFS_XATTR is not set # CONFIG_CIFS_DEBUG2 is not set # CONFIG_CIFS_EXPERIMENTAL is not set # CONFIG_NCP_FS is not set # CONFIG_CODA_FS is not set # CONFIG_AFS_FS is not set # CONFIG_9P_FS is not set # # Partition Types # # CONFIG_PARTITION_ADVANCED is not set CONFIG_MSDOS_PARTITION=y # # Native Language Support # CONFIG_NLS=y CONFIG_NLS_DEFAULT="iso8859-1" CONFIG_NLS_CODEPAGE_437=y # CONFIG_NLS_CODEPAGE_737 is not set # CONFIG_NLS_CODEPAGE_775 is not set CONFIG_NLS_CODEPAGE_850=y # CONFIG_NLS_CODEPAGE_852 is not set # CONFIG_NLS_CODEPAGE_855 is not set # CONFIG_NLS_CODEPAGE_857 is not set # CONFIG_NLS_CODEPAGE_860 is not set # CONFIG_NLS_CODEPAGE_861 is not set # CONFIG_NLS_CODEPAGE_862 is not set # CONFIG_NLS_CODEPAGE_863 is not set # CONFIG_NLS_CODEPAGE_864 is not set # CONFIG_NLS_CODEPAGE_865 is not set # CONFIG_NLS_CODEPAGE_866 is not set # CONFIG_NLS_CODEPAGE_869 is not set # CONFIG_NLS_CODEPAGE_936 is not set # CONFIG_NLS_CODEPAGE_950 is not set CONFIG_NLS_CODEPAGE_932=y # CONFIG_NLS_CODEPAGE_949 is not set # CONFIG_NLS_CODEPAGE_874 is not set # CONFIG_NLS_ISO8859_8 is not set # CONFIG_NLS_CODEPAGE_1250 is not set # CONFIG_NLS_CODEPAGE_1251 is not set # CONFIG_NLS_ASCII is not set CONFIG_NLS_ISO8859_1=y # CONFIG_NLS_ISO8859_2 is not set # CONFIG_NLS_ISO8859_3 is not set # CONFIG_NLS_ISO8859_4 is not set # CONFIG_NLS_ISO8859_5 is not set # CONFIG_NLS_ISO8859_6 is not set # CONFIG_NLS_ISO8859_7 is not set # CONFIG_NLS_ISO8859_9 is not set # CONFIG_NLS_ISO8859_13 is not set # CONFIG_NLS_ISO8859_14 is not set CONFIG_NLS_ISO8859_15=y # CONFIG_NLS_KOI8_R is not set # CONFIG_NLS_KOI8_U is not set CONFIG_NLS_UTF8=y # # Distributed Lock Manager # # CONFIG_DLM is not set # # Instrumentation Support # # CONFIG_PROFILING is not set # CONFIG_KPROBES is not set # # Kernel hacking # CONFIG_TRACE_IRQFLAGS_SUPPORT=y CONFIG_PRINTK_TIME=y CONFIG_ENABLE_MUST_CHECK=y CONFIG_MAGIC_SYSRQ=y # CONFIG_UNUSED_SYMBOLS is not set CONFIG_DEBUG_FS=y # CONFIG_HEADERS_CHECK is not set CONFIG_DEBUG_KERNEL=y CONFIG_LOG_BUF_SHIFT=15 CONFIG_DETECT_SOFTLOCKUP=y # CONFIG_SCHEDSTATS is not set # CONFIG_DEBUG_SLAB is not set # CONFIG_DEBUG_RT_MUTEXES is not set # CONFIG_RT_MUTEX_TESTER is not set # CONFIG_DEBUG_SPINLOCK is not set # CONFIG_DEBUG_MUTEXES is not set # CONFIG_DEBUG_RWSEMS is not set # CONFIG_DEBUG_LOCK_ALLOC is not set # CONFIG_PROVE_LOCKING is not set # CONFIG_DEBUG_SPINLOCK_SLEEP is not set # CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set # CONFIG_DEBUG_KOBJECT is not set CONFIG_DEBUG_BUGVERBOSE=y # CONFIG_DEBUG_INFO is not set # CONFIG_DEBUG_VM is not set # CONFIG_DEBUG_LIST is not set # CONFIG_FRAME_POINTER is not set CONFIG_FORCED_INLINING=y # CONFIG_RCU_TORTURE_TEST is not set CONFIG_EARLY_PRINTK=y # CONFIG_DEBUG_STACKOVERFLOW is not set # CONFIG_DEBUG_STACK_USAGE is not set # # Page alloc debug is incompatible with Software Suspend on i386 # # CONFIG_DEBUG_RODATA is not set CONFIG_4KSTACKS=y CONFIG_X86_FIND_SMP_CONFIG=y CONFIG_X86_MPPARSE=y CONFIG_DOUBLEFAULT=y # # Security options # # CONFIG_KEYS is not set # CONFIG_SECURITY is not set # # Cryptographic options # CONFIG_CRYPTO=y CONFIG_CRYPTO_ALGAPI=y CONFIG_CRYPTO_BLKCIPHER=y CONFIG_CRYPTO_MANAGER=y # CONFIG_CRYPTO_HMAC is not set # CONFIG_CRYPTO_XCBC is not set # CONFIG_CRYPTO_NULL is not set # CONFIG_CRYPTO_MD4 is not set # CONFIG_CRYPTO_MD5 is not set # CONFIG_CRYPTO_SHA1 is not set # CONFIG_CRYPTO_SHA256 is not set # CONFIG_CRYPTO_SHA512 is not set # CONFIG_CRYPTO_WP512 is not set # CONFIG_CRYPTO_TGR192 is not set # CONFIG_CRYPTO_GF128MUL is not set CONFIG_CRYPTO_ECB=y CONFIG_CRYPTO_CBC=y # CONFIG_CRYPTO_LRW is not set # CONFIG_CRYPTO_DES is not set # CONFIG_CRYPTO_BLOWFISH is not set # CONFIG_CRYPTO_TWOFISH is not set # CONFIG_CRYPTO_TWOFISH_586 is not set # CONFIG_CRYPTO_SERPENT is not set CONFIG_CRYPTO_AES=y # CONFIG_CRYPTO_AES_586 is not set # CONFIG_CRYPTO_CAST5 is not set # CONFIG_CRYPTO_CAST6 is not set # CONFIG_CRYPTO_TEA is not set CONFIG_CRYPTO_ARC4=y # CONFIG_CRYPTO_KHAZAD is not set # CONFIG_CRYPTO_ANUBIS is not set # CONFIG_CRYPTO_DEFLATE is not set CONFIG_CRYPTO_MICHAEL_MIC=y # CONFIG_CRYPTO_CRC32C is not set # CONFIG_CRYPTO_TEST is not set # # Hardware crypto devices # # CONFIG_CRYPTO_DEV_PADLOCK is not set CONFIG_CRYPTO_DEV_GEODE=m # # Library routines # CONFIG_BITREVERSE=y CONFIG_CRC_CCITT=y # CONFIG_CRC16 is not set CONFIG_CRC32=y # CONFIG_LIBCRC32C is not set CONFIG_ZLIB_INFLATE=y CONFIG_ZLIB_DEFLATE=y CONFIG_PLIST=y CONFIG_IOMAP_COPY=y CONFIG_GENERIC_HARDIRQS=y CONFIG_GENERIC_IRQ_PROBE=y CONFIG_X86_BIOS_REBOOT=y CONFIG_KTIME_SCALAR=y dmesg: [ 0.000000] Linux version 2.6.20-rc2 (ranma@navi) (gcc version 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)) #26 Mon Dec 25 14:00:08 CET 2006 [ 0.000000] BIOS-provided physical RAM map: [ 0.000000] sanitize start [ 0.000000] sanitize end [ 0.000000] copy_e820_map() start: 0000000000000000 size: 000000000009f000 end: 000000000009f000 type: 1 [ 0.000000] copy_e820_map() type is E820_RAM [ 0.000000] copy_e820_map() start: 000000000009f000 size: 0000000000001000 end: 00000000000a0000 type: 2 [ 0.000000] copy_e820_map() start: 00000000000dc000 size: 0000000000024000 end: 0000000000100000 type: 2 [ 0.000000] copy_e820_map() start: 0000000000100000 size: 000000001fde0000 end: 000000001fee0000 type: 1 [ 0.000000] copy_e820_map() type is E820_RAM [ 0.000000] copy_e820_map() start: 000000001fee0000 size: 0000000000015000 end: 000000001fef5000 type: 3 [ 0.000000] copy_e820_map() start: 000000001fef5000 size: 000000000000b000 end: 000000001ff00000 type: 4 [ 0.000000] copy_e820_map() start: 000000001ff00000 size: 0000000000100000 end: 0000000020000000 type: 2 [ 0.000000] copy_e820_map() start: 00000000e0000000 size: 0000000010000000 end: 00000000f0000000 type: 2 [ 0.000000] copy_e820_map() start: 00000000f0008000 size: 0000000000004000 end: 00000000f000c000 type: 2 [ 0.000000] copy_e820_map() start: 00000000fec00000 size: 0000000000010000 end: 00000000fec10000 type: 2 [ 0.000000] copy_e820_map() start: 00000000fed14000 size: 0000000000006000 end: 00000000fed1a000 type: 2 [ 0.000000] copy_e820_map() start: 00000000fed20000 size: 0000000000070000 end: 00000000fed90000 type: 2 [ 0.000000] copy_e820_map() start: 00000000fee00000 size: 0000000000001000 end: 00000000fee01000 type: 2 [ 0.000000] copy_e820_map() start: 00000000ff000000 size: 0000000001000000 end: 0000000100000000 type: 2 [ 0.000000] BIOS-e820: 0000000000000000 - 000000000009f000 (usable) [ 0.000000] BIOS-e820: 000000000009f000 - 00000000000a0000 (reserved) [ 0.000000] BIOS-e820: 00000000000dc000 - 0000000000100000 (reserved) [ 0.000000] BIOS-e820: 0000000000100000 - 000000001fee0000 (usable) [ 0.000000] BIOS-e820: 000000001fee0000 - 000000001fef5000 (ACPI data) [ 0.000000] BIOS-e820: 000000001fef5000 - 000000001ff00000 (ACPI NVS) [ 0.000000] BIOS-e820: 000000001ff00000 - 0000000020000000 (reserved) [ 0.000000] BIOS-e820: 00000000e0000000 - 00000000f0000000 (reserved) [ 0.000000] BIOS-e820: 00000000f0008000 - 00000000f000c000 (reserved) [ 0.000000] BIOS-e820: 00000000fec00000 - 00000000fec10000 (reserved) [ 0.000000] BIOS-e820: 00000000fed14000 - 00000000fed1a000 (reserved) [ 0.000000] BIOS-e820: 00000000fed20000 - 00000000fed90000 (reserved) [ 0.000000] BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved) [ 0.000000] BIOS-e820: 00000000ff000000 - 0000000100000000 (reserved) [ 0.000000] 510MB LOWMEM available. [ 0.000000] Entering add_active_range(0, 0, 130784) 0 entries of 256 used [ 0.000000] Zone PFN ranges: [ 0.000000] DMA 0 -> 4096 [ 0.000000] Normal 4096 -> 130784 [ 0.000000] early_node_map[1] active PFN ranges [ 0.000000] 0: 0 -> 130784 [ 0.000000] On node 0 totalpages: 130784 [ 0.000000] DMA zone: 32 pages used for memmap [ 0.000000] DMA zone: 0 pages reserved [ 0.000000] DMA zone: 4064 pages, LIFO batch:0 [ 0.000000] Normal zone: 989 pages used for memmap [ 0.000000] Normal zone: 125699 pages, LIFO batch:31 [ 0.000000] DMI present. [ 0.000000] ACPI: RSDP (v002 IBM ) @ 0x000f6bf0 [ 0.000000] ACPI: XSDT (v001 IBM TP-76 0x00001270 LTP 0x00000000) @ 0x1fee6f9b [ 0.000000] ACPI: FADT (v003 IBM TP-76 0x00001270 IBM 0x00000001) @ 0x1fee7000 [ 0.000000] ACPI: SSDT (v001 IBM TP-76 0x00001270 MSFT 0x0100000e) @ 0x1fee71b4 [ 0.000000] ACPI: ECDT (v001 IBM TP-76 0x00001270 IBM 0x00000001) @ 0x1fef4d46 [ 0.000000] ACPI: TCPA (v001 IBM TP-76 0x00001270 PTL 0x00000001) @ 0x1fef4d98 [ 0.000000] ACPI: MADT (v001 IBM TP-76 0x00001270 IBM 0x00000001) @ 0x1fef4dca [ 0.000000] ACPI: MCFG (v001 IBM TP-76 0x00001270 IBM 0x00000001) @ 0x1fef4e24 [ 0.000000] ACPI: BOOT (v001 IBM TP-76 0x00001270 LTP 0x00000001) @ 0x1fef4fd8 [ 0.000000] ACPI: DSDT (v001 IBM TP-76 0x00001270 MSFT 0x0100000e) @ 0x00000000 [ 0.000000] ACPI: PM-Timer IO Port: 0x1008 [ 0.000000] ACPI: Local APIC address 0xfee00000 [ 0.000000] ACPI: LAPIC (acpi_id[0x01] lapic_id[0x00] enabled) [ 0.000000] Processor #0 6:13 APIC version 20 [ 0.000000] ACPI: LAPIC_NMI (acpi_id[0x01] high edge lint[0x1]) [ 0.000000] ACPI: IOAPIC (id[0x01] address[0xfec00000] gsi_base[0]) [ 0.000000] IOAPIC[0]: apic_id 1, version 32, address 0xfec00000, GSI 0-23 [ 0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl) [ 0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level) [ 0.000000] ACPI: IRQ0 used by override. [ 0.000000] ACPI: IRQ2 used by override. [ 0.000000] ACPI: IRQ9 used by override. [ 0.000000] Enabling APIC mode: Flat. Using 1 I/O APICs [ 0.000000] Using ACPI (MADT) for SMP configuration information [ 0.000000] Allocating PCI resources starting at 30000000 (gap: 20000000:c0000000) [ 0.000000] Detected 1995.186 MHz processor. [ 2.815181] Built 1 zonelists. Total pages: 129763 [ 2.815183] Kernel command line: root=/dev/sda6 resume=/dev/sda5 vga=ext parport=auto ide0=noprobe ide1=noprobe libata.atapi_enabled=1 ro [ 2.815401] mapped APIC to ffff9000 (fee00000) [ 2.815404] mapped IOAPIC to ffff8000 (fec00000) [ 2.815406] Enabling fast FPU save and restore... done. [ 2.815408] Enabling unmasked SIMD FPU exception support... done. [ 2.815416] Initializing CPU#0 [ 2.815473] CPU 0 irqstacks, hard=c05f3000 soft=c05f2000 [ 2.815476] PID hash table entries: 2048 (order: 11, 8192 bytes) [ 2.815491] is_hpet_capable() [ 2.815493] trying to force-enable HPET [ 2.815498] RCBA already mapped at f0008000 [ 2.815501] HPTC: RCBA Base is 0xf0008000, mapped at 0xffffc000 to 0xfffff000 [ 2.815505] HPTC: RCBA 0x3404 is 0x00000080n<3>Intel HPET force-enabled at 0xfed00000 [ 2.817499] Console: colour VGA+ 80x50 [ 2.821573] Dentry cache hash table entries: 65536 (order: 6, 262144 bytes) [ 2.821816] Inode-cache hash table entries: 32768 (order: 5, 131072 bytes) [ 2.831460] Memory: 512836k/523136k available (3392k kernel code, 9880k reserved, 1444k data, 200k init, 0k highmem) [ 2.831572] virtual kernel memory layout: [ 2.831573] fixmap : 0xfffb3000 - 0xfffff000 ( 304 kB) [ 2.831574] vmalloc : 0xe0800000 - 0xfffb1000 ( 503 MB) [ 2.831575] lowmem : 0xc0000000 - 0xdfee0000 ( 510 MB) [ 2.831576] .init : 0xc05bb000 - 0xc05ed000 ( 200 kB) [ 2.831577] .data : 0xc0450065 - 0xc05b90b8 (1444 kB) [ 2.831579] .text : 0xc0100000 - 0xc0450065 (3392 kB) [ 2.832061] Checking if this processor honours the WP bit even in supervisor mode... Ok. [ 2.832297] hpet_enable [ 2.832382] hpet0: at MMIO 0xfed00000, IRQs 2, 8, 0 [ 2.832602] hpet0: 3 64-bit timers, 14318180 Hz [ 2.833675] Using HPET for base-timer [ 2.915669] Calibrating delay using timer specific routine.. 3994.20 BogoMIPS (lpj=6654729) [ 2.915836] Mount-cache hash table entries: 512 [ 2.915981] CPU: After generic identify, caps: afe9fbff 00100000 00000000 00000000 00000180 00000000 00000000 [ 2.915990] CPU: L1 I cache: 32K, L1 D cache: 32K [ 2.916101] CPU: L2 cache: 2048K [ 2.916171] CPU: After all inits, caps: afe9fbff 00100000 00000000 00002040 00000180 00000000 00000000 [ 2.916176] Intel machine check architecture supported. [ 2.916248] Intel machine check reporting enabled on CPU#0. [ 2.916320] Compat vDSO mapped to ffffa000. [ 2.916396] CPU: Intel(R) Pentium(R) M processor 2.00GHz stepping 08 [ 2.916543] Checking 'hlt' instruction... OK. [ 2.929131] ACPI: Core revision 20060707 [ 2.945497] ENABLING IO-APIC IRQs [ 2.945753] ..TIMER: vector=0x31 apic1=0 pin1=2 apic2=-1 pin2=-1 [ 3.082402] NET: Registered protocol family 16 [ 3.082649] ACPI: ACPI Dock Station Driver [ 3.082749] ACPI: bus type pci registered [ 3.082824] PCI: Using MMCONFIG [ 3.083561] Setting up standard PCI resources [ 3.093993] ACPI: Interpreter enabled [ 3.094065] ACPI: Using IOAPIC for interrupt routing [ 3.094750] ACPI: PCI Interrupt Link [LNKA] (IRQs 3 4 5 6 7 9 10 *11) [ 3.095684] ACPI: PCI Interrupt Link [LNKB] (IRQs 3 4 5 6 7 9 10 *11) [ 3.096606] ACPI: PCI Interrupt Link [LNKC] (IRQs 3 4 5 6 7 9 10 *11) [ 3.097521] ACPI: PCI Interrupt Link [LNKD] (IRQs 3 4 5 6 7 9 10 *11) [ 3.098438] ACPI: PCI Interrupt Link [LNKE] (IRQs 3 4 5 6 7 9 10 *11) [ 3.099369] ACPI: PCI Interrupt Link [LNKF] (IRQs 3 4 5 6 7 9 10 *11) [ 3.100284] ACPI: PCI Interrupt Link [LNKG] (IRQs 3 4 *5 6 7 9 10 11) [ 3.101201] ACPI: PCI Interrupt Link [LNKH] (IRQs 3 4 5 6 7 9 10 *11) [ 3.101926] ACPI: PCI Root Bridge [PCI0] (0000:00) [ 3.102000] PCI: Probing PCI hardware (bus 00) [ 3.103425] HPTC: RCBA Base is 0xf0008000 [ 3.103498] HPTC: RCBA 0x3404 is 0x80 [ 3.103566] HPTC: HPTC enabled [ 3.103635] HPTC: HPET located at 0xfed00000 [ 3.103707] PCI quirk: region 1000-107f claimed by ICH6 ACPI/GPIO/TCO [ 3.103779] PCI quirk: region 1180-11bf claimed by ICH6 GPIO [ 3.103989] Boot video device is 0000:01:00.0 [ 3.104433] PCI: Transparent bridge - 0000:00:1e.0 [ 3.104583] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT] [ 3.109110] ACPI: Power Resource [PUBS] (on) [ 3.110023] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.AGP_._PRT] [ 3.110275] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.EXP0._PRT] [ 3.110438] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.EXP2._PRT] [ 3.110626] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.PCI1._PRT] [ 3.112300] Linux Plug and Play Support v0.97 (c) Adam Belay [ 3.112376] pnp: PnP ACPI init [ 3.115896] pnp: PnP ACPI: found 13 devices [ 3.115984] intel_rng: FWH not detected [ 3.116140] SCSI subsystem initialized [ 3.116225] libata version 2.00 loaded. [ 3.116258] usbcore: registered new interface driver usbfs [ 3.116349] usbcore: registered new interface driver hub [ 3.116441] usbcore: registered new device driver usb [ 3.116549] PCI: Using ACPI for IRQ routing [ 3.116621] PCI: If a device doesn't work, try "pci=routeirq". If it helps, post a report [ 3.215491] Bluetooth: Core ver 2.11 [ 3.215587] NET: Registered protocol family 31 [ 3.215657] Bluetooth: HCI device and connection manager initialized [ 3.215728] Bluetooth: HCI socket layer initialized [ 3.216289] ieee1394: Initialized config rom entry `ip1394' [ 3.216345] PCI: Bridge: 0000:00:01.0 [ 3.216417] IO window: 3000-3fff [ 3.216487] MEM window: b0100000-b01fffff [ 3.216557] PREFETCH window: c0000000-c7ffffff [ 3.216626] PCI: Bridge: 0000:00:1c.0 [ 3.216694] IO window: disabled. [ 3.216766] MEM window: b0200000-b02fffff [ 3.216835] PREFETCH window: disabled. [ 3.216906] PCI: Bridge: 0000:00:1c.2 [ 3.216976] IO window: 4000-4fff [ 3.217047] MEM window: b2000000-b3ffffff [ 3.217117] PREFETCH window: c8000000-c80fffff [ 3.217190] PCI: Bus 12, cardbus bridge: 0000:0b:00.0 [ 3.217260] IO window: 00005000-000050ff [ 3.217331] IO window: 00005400-000054ff [ 3.217403] PREFETCH window: d0000000-d3ffffff [ 3.217474] MEM window: b8000000-bbffffff [ 3.217545] PCI: Bridge: 0000:00:1e.0 [ 3.217615] IO window: 5000-8fff [ 3.217686] MEM window: b4000000-bfffffff [ 3.217757] PREFETCH window: d0000000-d7ffffff [ 3.217834] ACPI: PCI Interrupt 0000:00:01.0[A] -> GSI 16 (level, low) -> IRQ 16 [ 3.217973] PCI: Setting latency timer of device 0000:00:01.0 to 64 [ 3.217986] ACPI: PCI Interrupt 0000:00:1c.0[A] -> GSI 20 (level, low) -> IRQ 17 [ 3.218125] PCI: Setting latency timer of device 0000:00:1c.0 to 64 [ 3.218140] ACPI: PCI Interrupt 0000:00:1c.2[C] -> GSI 22 (level, low) -> IRQ 18 [ 3.218278] PCI: Setting latency timer of device 0000:00:1c.2 to 64 [ 3.218287] PCI: Setting latency timer of device 0000:00:1e.0 to 64 [ 3.218298] ACPI: PCI Interrupt 0000:0b:00.0[A] -> GSI 16 (level, low) -> IRQ 16 [ 3.218448] NET: Registered protocol family 2 [ 3.248830] IP route cache hash table entries: 4096 (order: 2, 16384 bytes) [ 3.248952] TCP established hash table entries: 16384 (order: 4, 65536 bytes) [ 3.249071] TCP bind hash table entries: 8192 (order: 3, 32768 bytes) [ 3.249167] TCP: Hash tables configured (established 16384 bind 8192) [ 3.249238] TCP reno registered [ 3.258925] Simple Boot Flag at 0x35 set to 0x1 [ 3.259018] Machine check exception polling timer started. [ 3.259287] Installing knfsd (copyright (C) 1996 okir@monad.swb.de). [ 3.259478] NTFS driver 2.1.27 [Flags: R/W]. [ 3.259606] io scheduler noop registered [ 3.259714] io scheduler anticipatory registered (default) [ 3.259858] io scheduler deadline registered [ 3.259969] io scheduler cfq registered [ 3.261652] PCI: Setting latency timer of device 0000:00:01.0 to 64 [ 3.261667] assign_interrupt_mode Found MSI capability [ 3.261754] Allocate Port Service[0000:00:01.0:pcie00] [ 3.261774] Allocate Port Service[0000:00:01.0:pcie03] [ 3.261816] PCI: Setting latency timer of device 0000:00:1c.0 to 64 [ 3.261852] assign_interrupt_mode Found MSI capability [ 3.261949] Allocate Port Service[0000:00:1c.0:pcie00] [ 3.261967] Allocate Port Service[0000:00:1c.0:pcie02] [ 3.261987] Allocate Port Service[0000:00:1c.0:pcie03] [ 3.262059] PCI: Setting latency timer of device 0000:00:1c.2 to 64 [ 3.262095] assign_interrupt_mode Found MSI capability [ 3.262197] Allocate Port Service[0000:00:1c.2:pcie00] [ 3.262215] Allocate Port Service[0000:00:1c.2:pcie02] [ 3.262233] Allocate Port Service[0000:00:1c.2:pcie03] [ 3.262314] pci_hotplug: PCI Hot Plug PCI Core version: 0.5 [ 3.262387] ibmphpd: IBM Hot Plug PCI Controller Driver version: 0.6 [ 3.262462] acpiphp: ACPI Hot Plug PCI Controller Driver version: 0.5 [ 3.265962] decode_hpp: Could not get hotplug parameters. Use defaults [ 3.266059] acpiphp: Slot [1] registered [ 3.267122] acpiphp_ibm: ibm_find_acpi_device: Failed to get device information<3>acpiphp_ibm: ibm_find_acpi_device: Failed to get device information<3>acpiphp_ibm: ibm_find_acpi_device: Failed to get device information<3>acpiphp_ibm: ibm_acpiphp_init: acpi_walk_namespace failed [ 3.269969] ACPI: AC Adapter [AC] (on-line) [ 3.278158] ACPI: Battery Slot [BAT0] (battery present) [ 3.278284] input: Power Button (FF) as /class/input/input0 [ 3.278358] ACPI: Power Button (FF) [PWRF] [ 3.278459] input: Lid Switch as /class/input/input1 [ 3.278532] ACPI: Lid Switch [LID] [ 3.278632] input: Sleep Button (CM) as /class/input/input2 [ 3.278706] ACPI: Sleep Button (CM) [SLPB] [ 3.278995] ACPI: Video Device [VID] (multi-head: yes rom: no post: no) [ 3.280635] ACPI: CPU0 (power states: C1[C1] C2[C2] C3[C3]) [ 3.280857] ACPI: Processor [CPU] (supports 8 throttling states) [ 3.281966] ACPI: Thermal Zone [THM0] (63 C) [ 3.283336] Real Time Clock Driver v1.12ac [ 3.283432] Linux agpgart interface v0.101 (c) Dave Jones [ 3.283522] agpgart: Detected an Intel 915GM Chipset. [ 3.300594] agpgart: AGP aperture is 256M @ 0x0 [ 3.300695] [drm] Initialized drm 1.1.0 20060810 [ 3.300877] tpm_nsc tpm_nscl0: NSC TPM revision 2 [ 3.301002] Serial: 8250/16550 driver $Revision: 1.90 $ 4 ports, IRQ sharing disabled [ 3.301249] serial8250: ttyS1 at I/O 0x2f8 (irq = 3) is a NS16550A [ 3.302001] pnp: Device 00:09 activated. [ 3.302186] 00:09: ttyS0 at I/O 0x3f8 (irq = 4) is a NS16550A [ 3.302352] ACPI: PCI Interrupt 0000:00:1e.3[B] -> GSI 23 (level, low) -> IRQ 19 [ 3.302495] ACPI: PCI interrupt for device 0000:00:1e.3 disabled [ 3.302599] parport: PnPBIOS parport detected. [ 3.302704] parport0: PC-style at 0x3bc (0x7bc), irq 7 [PCSPP(,...)] [ 3.303238] loop: loaded (max 8 devices) [ 3.303352] nbd: registered device at major 43 [ 3.303761] Ethernet Channel Bonding Driver: v3.1.1 (September 26, 2006) [ 3.303837] bonding: Warning: either miimon or arp_interval and arp_ip_target module parameters must be specified, otherwise bonding will not detect link failures! see bonding.txt for details. [ 3.304025] pcnet32.c:v1.33 27.Jun.2006 tsbogend@alpha.franken.de [ 3.304115] e100: Intel(R) PRO/100 Network Driver, 3.5.17-k2-NAPI [ 3.304185] e100: Copyright(c) 1999-2006 Intel Corporation [ 3.304292] tg3.c:v3.71 (December 15, 2006) [ 3.304377] ACPI: PCI Interrupt 0000:02:00.0[A] -> GSI 16 (level, low) -> IRQ 16 [ 3.304520] PCI: Setting latency timer of device 0000:02:00.0 to 64 [ 0.399999] eth0: Tigon3 [partno(BCM95751M) rev 4101 PHY(5750)] (PCI Express) 10/100/1000Base-T Ethernet 00:0a:e4:c1:27:01 [ 0.399999] eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] Split[0] WireSpeed[1] TSOcap[1] [ 0.399999] eth0: dma_rwctrl[76180000] dma_mask[64-bit] [ 0.399999] PPP generic driver version 2.4.2 [ 0.399999] PPP Deflate Compression module registered [ 0.399999] PPP BSD Compression module registered [ 0.403333] NET: Registered protocol family 24 [ 0.403333] tun: Universal TUN/TAP device driver, 1.6 [ 0.403333] tun: (C) 1999-2004 Max Krasnyansky <maxk@qualcomm.com> [ 0.403333] netconsole: not configured, aborting [ 0.403333] ahci 0000:00:1f.2: version 2.0 [ 0.403333] ahci: probe of 0000:00:1f.2 failed with error -12 [ 0.403333] ata_piix 0000:00:1f.2: version 2.00ac7 [ 0.403333] ata_piix 0000:00:1f.2: MAP [ P0 P2 IDE IDE ] [ 0.403333] PCI: Setting latency timer of device 0000:00:1f.2 to 64 [ 0.403333] ata1: SATA max UDMA/133 cmd 0x1F0 ctl 0x3F6 bmdma 0x18C0 irq 14 [ 0.403333] ata2: PATA max UDMA/100 cmd 0x170 ctl 0x376 bmdma 0x18C8 irq 15 [ 0.403333] scsi0 : ata_piix [ 0.563333] ata1.00: ATA-6, max UDMA/100, 195371568 sectors: LBA [ 0.563333] ata1.00: ata1: dev 0 multi count 16 [ 0.563333] ata1.00: applying bridge limits [ 0.573333] ata1.00: configured for UDMA/100 [ 0.573333] scsi1 : ata_piix [ 0.886666] ata2.00: ATAPI, max UDMA/33 [ 1.046666] ata2.00: configured for UDMA/33 [ 1.046666] scsi 0:0:0:0: Direct-Access ATA FUJITSU MHV2100A 0084 PQ: 0 ANSI: 5 [ 1.046666] SCSI device sda: 195371568 512-byte hdwr sectors (100030 MB) [ 1.046666] sda: Write Protect is off [ 1.046666] sda: Mode Sense: 00 3a 00 00 [ 1.046666] SCSI device sda: write cache: enabled, read cache: enabled, doesn't support DPO or FUA [ 1.046666] SCSI device sda: 195371568 512-byte hdwr sectors (100030 MB) [ 1.046666] sda: Write Protect is off [ 1.046666] sda: Mode Sense: 00 3a 00 00 [ 1.046666] SCSI device sda: write cache: enabled, read cache: enabled, doesn't support DPO or FUA [ 1.046666] sda: sda1 sda2 sda3 < sda5 sda6 sda7 > sda4 [ 1.113333] sd 0:0:0:0: Attached scsi disk sda [ 1.113333] sd 0:0:0:0: Attached scsi generic sg0 type 0 [ 1.116666] scsi 1:0:0:0: CD-ROM MATSHITA DVD-RAM UJ-830S 1.02 PQ: 0 ANSI: 5 [ 1.123333] sr0: scsi3-mmc drive: 24x/24x writer dvd-ram cd/rw xa/form2 cdda tray [ 1.123333] Uniform CD-ROM driver Revision: 3.20 [ 1.123333] sr 1:0:0:0: Attached scsi CD-ROM sr0 [ 1.123333] sr 1:0:0:0: Attached scsi generic sg1 type 5 [ 1.123333] ieee1394: raw1394: /dev/raw1394 device initialized [ 1.123333] Yenta: CardBus bridge found at 0000:0b:00.0 [1014:0532] [ 1.249999] Yenta: ISA IRQ mask 0x0438, PCI irq 16 [ 1.249999] Socket status: 30000006 [ 1.249999] pcmcia: parent PCI bridge I/O window: 0x5000 - 0x8fff [ 1.249999] cs: IO port probe 0x5000-0x8fff: clean. [ 1.249999] pcmcia: parent PCI bridge Memory window: 0xb4000000 - 0xbfffffff [ 1.249999] pcmcia: parent PCI bridge Memory window: 0xd0000000 - 0xd7ffffff [ 1.503333] usbcore: registered new interface driver usblp [ 1.503333] drivers/usb/class/usblp.c: v0.13: USB Printer Device Class driver [ 1.503333] usbcore: registered new interface driver libusual [ 1.503333] usbcore: registered new interface driver usbhid [ 1.503333] drivers/usb/input/hid-core.c: v2.6:USB HID core driver [ 1.503333] usbcore: registered new interface driver asix [ 1.503333] usbcore: registered new interface driver usbserial [ 1.503333] drivers/usb/serial/usb-serial.c: USB Serial support registered for generic [ 1.503333] usbcore: registered new interface driver usbserial_generic [ 1.503333] drivers/usb/serial/usb-serial.c: USB Serial Driver core [ 1.503333] drivers/usb/serial/usb-serial.c: USB Serial support registered for hp4X [ 1.503333] usbcore: registered new interface driver hp4X [ 1.503333] drivers/usb/serial/hp4x.c: HP4x (48/49) Generic Serial driver v1.00 [ 1.503333] drivers/usb/serial/usb-serial.c: USB Serial support registered for pl2303 [ 1.503333] usbcore: registered new interface driver pl2303 [ 1.503333] drivers/usb/serial/pl2303.c: Prolific PL2303 USB to serial adaptor driver [ 1.503333] PNP: PS/2 Controller [PNP0303:KBD,PNP0f13:MOU] at 0x60,0x64 irq 1,12 [ 1.509999] serio: i8042 KBD port at 0x60,0x64 irq 1 [ 1.509999] serio: i8042 AUX port at 0x60,0x64 irq 12 [ 1.509999] mice: PS/2 mouse device common for all mice [ 1.513333] input: AT Translated Set 2 keyboard as /class/input/input3 [ 1.519999] i2c /dev entries driver [ 1.519999] ACPI: PCI Interrupt 0000:00:1f.3[A] -> GSI 23 (level, low) -> IRQ 19 [ 1.519999] device-mapper: ioctl: 4.11.0-ioctl (2006-10-12) initialised: dm-devel@redhat.com [ 1.519999] EDAC MC: Ver: 2.0.1 Dec 25 2006 [ 1.549999] Advanced Linux Sound Architecture Driver Version 1.0.14rc1 (Wed Dec 20 08:11:48 2006 UTC). [ 1.549999] ACPI: PCI Interrupt 0000:00:1e.2[A] -> GSI 22 (level, low) -> IRQ 18 [ 1.549999] PCI: Setting latency timer of device 0000:00:1e.2 to 64 [ 1.816666] ACPI: EC: evaluating _Q75 [ 2.133333] Synaptics Touchpad, model: 1, fw: 5.9, id: 0x2c6ab1, caps: 0x884793/0x0 [ 2.133333] serio: Synaptics pass-through port at isa0060/serio1/input0 [ 2.176666] input: SynPS/2 Synaptics TouchPad as /class/input/input4 [ 2.473333] intel8x0_measure_ac97_clock: measured 53330 usecs [ 2.473333] intel8x0: clocking to 48000 [ 2.473333] ALSA device list: [ 2.473333] #0: Intel ICH6 with AD1981B at 0xb0000800, irq 18 [ 2.473333] netem: version 1.2 [ 2.473333] u32 classifier [ 2.473333] Netfilter messages via NETLINK v0.30. [ 2.473333] ip_tables: (C) 2000-2006 Netfilter Core Team [ 2.553333] TCP bic registered [ 2.553333] TCP cubic registered [ 2.553333] TCP westwood registered [ 2.553333] TCP highspeed registered [ 2.553333] TCP vegas registered [ 2.553333] NET: Registered protocol family 1 [ 2.553333] NET: Registered protocol family 10 [ 2.553333] IPv6 over IPv4 tunneling driver [ 2.553333] NET: Registered protocol family 17 [ 2.633333] Bridge firewalling registered [ 2.633333] Bluetooth: L2CAP ver 2.8 [ 2.633333] Bluetooth: L2CAP socket layer initialized [ 2.633333] Bluetooth: SCO (Voice Link) ver 0.5 [ 2.633333] Bluetooth: SCO socket layer initialized [ 2.633333] Bluetooth: RFCOMM socket layer initialized [ 2.633333] Bluetooth: RFCOMM TTY layer initialized [ 2.633333] Bluetooth: RFCOMM ver 1.8 [ 2.633333] Bluetooth: BNEP (Ethernet Emulation) ver 1.2 [ 2.633333] Bluetooth: BNEP filters: protocol multicast [ 2.633333] Bluetooth: HIDP (Human Interface Emulation) ver 1.1 [ 2.633333] 802.1Q VLAN Support v1.8 Ben Greear <greearb@candelatech.com> [ 2.633333] All bugs added by David S. Miller <davem@redhat.com> [ 2.633333] ieee80211: 802.11 data/management/control stack, git-1.1.13 [ 2.633333] ieee80211: Copyright (C) 2004-2005 Intel Corporation <jketreno@linux.intel.com> [ 2.633333] ieee80211_crypt: registered algorithm 'NULL' [ 2.633333] ieee80211_crypt: registered algorithm 'WEP' [ 2.633333] ieee80211_crypt: registered algorithm 'CCMP' [ 2.633333] ieee80211_crypt: registered algorithm 'TKIP' [ 2.633333] speedstep-centrino with X86_SPEEDSTEP_CENTRINO_ACPIconfig is deprecated. [ 2.633333] Use X86_ACPI_CPUFREQ (acpi-cpufreq instead. [ 2.633333] Using IPI Shortcut mode [ 2.633333] ACPI: (supports S0 S3 S4 S5) [ 2.636666] Time: tsc clocksource has been installed. [ 2.643333] Time: hpet clocksource has been installed. [ 7.439999] IBM TrackPoint firmware: 0x0e, buttons: 3/3 [ 7.696665] input: TPPS/2 IBM TrackPoint as /class/input/input5 [ 7.703332] ACPI: EC: evaluating _Q75 [ 7.879999] kjournald starting. Commit interval 5 seconds [ 7.879999] EXT3-fs: mounted filesystem with ordered data mode. [ 7.879999] VFS: Mounted root (ext3 filesystem) readonly. [ 7.879999] Freeing unused kernel memory: 200k freed [ 11.453332] ACPI: PCI Interrupt 0000:0b:00.1[B] -> GSI 17 (level, low) -> IRQ 20 [ 11.506665] ohci1394: fw-host0: OHCI-1394 1.0 (PCI): IRQ=[20] MMIO=[b1000000-b10007ff] Max Packet=[2048] IR/IT contexts=[4/4] [ 11.516665] eth1394: eth0: IEEE-1394 IPv4 over 1394 Ethernet (fw-host0) [ 11.679998] cs: IO port probe 0x100-0x4ff: excluding 0x4d0-0x4d7 [ 11.683332] cs: IO port probe 0x800-0x8ff: clean. [ 11.683332] cs: IO port probe 0xc00-0xcff: clean. [ 11.683332] cs: IO port probe 0xa00-0xaff: clean. [ 12.166665] Adding 1958000k swap on /dev/sda5. Priority:10 extents:1 across:1958000k [ 12.319998] EXT3 FS on sda6, internal journal [ 12.693332] ibm_acpi: ThinkPad EC firmware 76HT16WW-1.06 [ 12.693332] ibm_acpi: IBM ThinkPad ACPI Extras v0.13 [ 12.693332] ibm_acpi: http://ibm-acpi.sf.net/ [ 12.699998] ibm_acpi: fan_init: initial fan status is unknown, assuming it is in auto mode [ 12.783332] ieee1394: Host added: ID:BUS[0-00:1023] GUID[000ae405314003e1] [ 13.063332] kjournald starting. Commit interval 5 seconds [ 13.063332] EXT3-fs: mounted filesystem with ordered data mode. [ 13.066665] kjournald starting. Commit interval 5 seconds [ 13.066665] EXT3 FS on sda7, internal journal [ 13.066665] EXT3-fs: mounted filesystem with ordered data mode. [ 13.493331] pcmcia: Detected deprecated PCMCIA ioctl usage from process: discover. [ 13.493331] pcmcia: This interface will soon be removed from the kernel; please expect breakage unless you upgrade to new tools. [ 13.493331] pcmcia: see http://www.kernel.org/pub/linux/utils/kernel/pcmcia/pcmcia.html for details. [ 15.979998] ieee1394: Node removed: ID:BUS[0-00:1023] GUID[000ae405314003e1] -- Tobias PGP: http://9ac7e0bc.uguu.de ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one 2006-12-26 16:17 ` Tobias Diedrich @ 2006-12-27 4:55 ` David Miller 2006-12-27 7:00 ` Linus Torvalds 2006-12-28 0:16 ` Linus Torvalds 0 siblings, 2 replies; 311+ messages in thread From: David Miller @ 2006-12-27 4:55 UTC (permalink / raw) To: ranma Cc: torvalds, gordonfarquharson, tbm, a.p.zijlstra, andrei.popa, akpm, hugh, nickpiggin, arjan, linux-kernel From: Tobias Diedrich <ranma@tdiedrich.de> Date: Tue, 26 Dec 2006 17:17:00 +0100 > Linus Torvalds wrote: > > I don't think it's a page table issue any more, it just doesn't look > > likely with the ARM UP corruption. It's also not apparently even on a > > cacheline boundary, so it probably is really a dirty bit that got cleared > > wrogn due to some race with IO. > > So, until now it's only been reported for SMP on i386? > I'm seeing the issue on my Pentium-M Notebook (Thinkpad R52) over > here, UP kernel, no preempt. I've seen it on sparc64, UP kernel, no preempt. ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one 2006-12-27 4:55 ` [PATCH] mm: fix page_mkclean_one David Miller @ 2006-12-27 7:00 ` Linus Torvalds 2006-12-27 8:39 ` Andrei Popa 2006-12-28 0:16 ` Linus Torvalds 1 sibling, 1 reply; 311+ messages in thread From: Linus Torvalds @ 2006-12-27 7:00 UTC (permalink / raw) To: David Miller Cc: ranma, gordonfarquharson, tbm, a.p.zijlstra, andrei.popa, akpm, hugh, nickpiggin, arjan, linux-kernel On Tue, 26 Dec 2006, David Miller wrote: > > I've seen it on sparc64, UP kernel, no preempt. Btw, having tried to debug the writeback code, there's one very special case that just makes me go "hmm". If we have a buffer that is "busy" when we try to write back a page, we have this magic "wbc->sync_mode == WB_SYNC_NONE && wbc->nonblocking" mode, where we won't wait for it, but instead we'll redirty the page and redo the whole thing. Looking at the code, that should all work, but at the same time, it triggers some of my debug messages about having a dirty page during writeback, and one way to trigger that debug message is to try to run rtorrent on the machine.. I dunno. Witht he writeback being suspicious, and the normal "block_write_full_page()" path being implicated in at least ext2, I just wonder. This is one of those "let's see if behaviour changes" patches, that I'm just throwing out there.. Linus --- diff --git a/fs/buffer.c b/fs/buffer.c index 263f88e..4652ef1 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -1653,19 +1653,7 @@ static int __block_write_full_page(struct inode *inode, struct page *page, do { if (!buffer_mapped(bh)) continue; - /* - * If it's a fully non-blocking write attempt and we cannot - * lock the buffer then redirty the page. Note that this can - * potentially cause a busy-wait loop from pdflush and kswapd - * activity, but those code paths have their own higher-level - * throttling. - */ - if (wbc->sync_mode != WB_SYNC_NONE || !wbc->nonblocking) { - lock_buffer(bh); - } else if (test_set_buffer_locked(bh)) { - redirty_page_for_writepage(wbc, page); - continue; - } + lock_buffer(bh); if (test_clear_buffer_dirty(bh)) { mark_buffer_async_write(bh); } else { ^ permalink raw reply related [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one 2006-12-27 7:00 ` Linus Torvalds @ 2006-12-27 8:39 ` Andrei Popa 0 siblings, 0 replies; 311+ messages in thread From: Andrei Popa @ 2006-12-27 8:39 UTC (permalink / raw) To: Linus Torvalds Cc: David Miller, ranma, gordonfarquharson, tbm, a.p.zijlstra, akpm, hugh, nickpiggin, arjan, linux-kernel I have corrupted files... > --- > diff --git a/fs/buffer.c b/fs/buffer.c > index 263f88e..4652ef1 100644 > --- a/fs/buffer.c > +++ b/fs/buffer.c > @@ -1653,19 +1653,7 @@ static int __block_write_full_page(struct inode *inode, struct page *page, > do { > if (!buffer_mapped(bh)) > continue; > - /* > - * If it's a fully non-blocking write attempt and we cannot > - * lock the buffer then redirty the page. Note that this can > - * potentially cause a busy-wait loop from pdflush and kswapd > - * activity, but those code paths have their own higher-level > - * throttling. > - */ > - if (wbc->sync_mode != WB_SYNC_NONE || !wbc->nonblocking) { > - lock_buffer(bh); > - } else if (test_set_buffer_locked(bh)) { > - redirty_page_for_writepage(wbc, page); > - continue; > - } > + lock_buffer(bh); > if (test_clear_buffer_dirty(bh)) { > mark_buffer_async_write(bh); > } else { ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one 2006-12-27 4:55 ` [PATCH] mm: fix page_mkclean_one David Miller 2006-12-27 7:00 ` Linus Torvalds @ 2006-12-28 0:16 ` Linus Torvalds 2006-12-28 0:39 ` Linus Torvalds 1 sibling, 1 reply; 311+ messages in thread From: Linus Torvalds @ 2006-12-28 0:16 UTC (permalink / raw) To: David Miller Cc: ranma, gordonfarquharson, tbm, a.p.zijlstra, andrei.popa, Andrew Morton, hugh, nickpiggin, arjan, Linux Kernel Mailing List On Tue, 26 Dec 2006, David Miller wrote: > > I've seen it on sparc64, UP kernel, no preempt. Ok, I still don't have a clue, but I think I at least have a new test-case. It can probably be improved upon, but this would _seem_ to trigger the problem. Can people check? You'd want to make sure you get page-put activity, by making TARGETSIZE be big enough to cause memory pressure (and rather than making it bigger, you might want to make your memory smaller instead, to make it run more quickly. Either using "mem=128M" or big compiles or something...). If it finds corruption, you'll see something like Writing chunk 183858/183859 (99%) Chunk .. Chunk 120887 corrupted Chunk 122372 corrupted Chunk ... Checking chunk 183858/183859 (99%) otherwise it will just say Writing chunk 183858/183859 (99%) Checking chunk 183858/183859 (99%) and exit. I didn't spend a lot of time verifying this, but I _was_ able to cause those "Chunk xxx corrupted" messages with this. There's probably a more efficient better way to do it, but this is better than trying to use rtorrent, and also makes any worries about what rtorrent does go away. Of course, maybe it's this test-program that is buggy now, although it looks trivial enough that I don't think it is. I think my earlier stress-tester may not have triggered this, because it just did all its writing in a linear order, so any LRU logic will happen to write back old pages that we are no longer touching. The randomization (and using a chunksize that isn't a multiple of a page-size) makes sure that we're actually going to have lots of rewriting going on. I think the test-case could probably be improved by having a munmap() and page-cache flush in between the writing and the checking, to see whether that shows the corruption easier (and possibly without having to start paging in order to throw the pages out, which would simplify testing a lot). But I haven't tested. I decided to post this asap, now that I've recreated the corruption with something else, and something that is possibly easier to analyze.. Linus ---- #include <sys/mman.h> #include <sys/fcntl.h> #include <unistd.h> #include <stdlib.h> #include <string.h> #include <stdio.h> #include <time.h> #define TARGETSIZE (256 << 20) #define CHUNKSIZE (1460) #define NRCHUNKS (TARGETSIZE / CHUNKSIZE) #define SIZE (NRCHUNKS * CHUNKSIZE) static void fillmem(void *start, int nr) { memset(start, nr, CHUNKSIZE); } static void checkmem(void *start, int nr) { unsigned char c = nr, *p = start; int i; for (i = 0; i < CHUNKSIZE; i++) { if (*p++ != c) { printf("Chunk %d corrupted \n", nr); return; } } } int main(int argc, char **argv) { char *mapping; int fd, i; static int chunkorder[NRCHUNKS]; /* * Make some random ordering of writing the chunks to the * memory map.. * * Start with fully ordered.. */ for (i = 0; i < NRCHUNKS; i++) chunkorder[i] = i; /* ..and then mix it up randomly */ srandom(time(NULL)); for (i = 0; i < NRCHUNKS; i++) { int index = (unsigned int) random() % NRCHUNKS; int nr = chunkorder[index]; chunkorder[index] = chunkorder[i]; chunkorder[i] = nr; } fd = open("mapfile", O_RDWR | O_TRUNC | O_CREAT, 0666); if (fd < 0) return -1; if (ftruncate(fd, SIZE) < 0) return -1; mapping = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); if (-1 == (int)(long)mapping) return -1; for (i = 0; i < NRCHUNKS; i++) { int chunk = chunkorder[i]; printf("Writing chunk %d/%d (%d%%) \r", i, NRCHUNKS, 100*i/NRCHUNKS); fillmem(mapping + chunk * CHUNKSIZE, chunk); } printf("\n"); for (i = 0; i < NRCHUNKS; i++) { int chunk = i; printf("Checking chunk %d/%d (%d%%) \r", i, NRCHUNKS, 100*i/NRCHUNKS); checkmem(mapping + chunk * CHUNKSIZE, chunk); } printf("\n"); return 0; } ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one 2006-12-28 0:16 ` Linus Torvalds @ 2006-12-28 0:39 ` Linus Torvalds 2006-12-28 0:52 ` David Miller 0 siblings, 1 reply; 311+ messages in thread From: Linus Torvalds @ 2006-12-28 0:39 UTC (permalink / raw) To: David Miller Cc: ranma, gordonfarquharson, tbm, a.p.zijlstra, andrei.popa, Andrew Morton, hugh, nickpiggin, arjan, Linux Kernel Mailing List On Wed, 27 Dec 2006, Linus Torvalds wrote: > > I think the test-case could probably be improved by having a munmap() and > page-cache flush in between the writing and the checking, to see whether > that shows the corruption easier (and possibly without having to start > paging in order to throw the pages out, which would simplify testing a > lot). I think the page-writeout is implicated, because I do seem to need it, but the page-cache flush does seem to make corruption _easier_ to see. I now seem about to trigger it with a 100MB file on a 256MB machine in a minute or so, with this slight modification. I still don't see _why_, though. But maybe smarter people than me can see it.. Linus --- #include <sys/mman.h> #include <sys/fcntl.h> #include <unistd.h> #include <stdlib.h> #include <string.h> #include <stdio.h> #include <time.h> #define TARGETSIZE (100 << 20) #define CHUNKSIZE (1460) #define NRCHUNKS (TARGETSIZE / CHUNKSIZE) #define SIZE (NRCHUNKS * CHUNKSIZE) static void fillmem(void *start, int nr) { memset(start, nr, CHUNKSIZE); } static void checkmem(void *start, int nr) { unsigned char c = nr, *p = start; int i; for (i = 0; i < CHUNKSIZE; i++) { if (*p++ != c) { printf("Chunk %d corrupted \n", nr); return; } } } static char *remap(int fd, char *mapping) { if (mapping) { munmap(mapping, SIZE); posix_fadvise(fd, 0, SIZE, POSIX_FADV_DONTNEED); } return mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); } int main(int argc, char **argv) { char *mapping; int fd, i; static int chunkorder[NRCHUNKS]; /* * Make some random ordering of writing the chunks to the * memory map.. * * Start with fully ordered.. */ for (i = 0; i < NRCHUNKS; i++) chunkorder[i] = i; /* ..and then mix it up randomly */ srandom(time(NULL)); for (i = 0; i < NRCHUNKS; i++) { int index = (unsigned int) random() % NRCHUNKS; int nr = chunkorder[index]; chunkorder[index] = chunkorder[i]; chunkorder[i] = nr; } fd = open("mapfile", O_RDWR | O_TRUNC | O_CREAT, 0666); if (fd < 0) return -1; if (ftruncate(fd, SIZE) < 0) return -1; mapping = remap(fd, NULL); if (-1 == (int)(long)mapping) return -1; for (i = 0; i < NRCHUNKS; i++) { int chunk = chunkorder[i]; printf("Writing chunk %d/%d (%d%%) \r", i, NRCHUNKS, 100*i/NRCHUNKS); fillmem(mapping + chunk * CHUNKSIZE, chunk); } printf("\n"); /* Unmap, drop, and remap.. */ mapping = remap(fd, mapping); /* .. and check */ for (i = 0; i < NRCHUNKS; i++) { int chunk = i; printf("Checking chunk %d/%d (%d%%) \r", i, NRCHUNKS, 100*i/NRCHUNKS); checkmem(mapping + chunk * CHUNKSIZE, chunk); } printf("\n"); return 0; } ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one 2006-12-28 0:39 ` Linus Torvalds @ 2006-12-28 0:52 ` David Miller 2006-12-28 3:04 ` Linus Torvalds 0 siblings, 1 reply; 311+ messages in thread From: David Miller @ 2006-12-28 0:52 UTC (permalink / raw) To: torvalds Cc: ranma, gordonfarquharson, tbm, a.p.zijlstra, andrei.popa, akpm, hugh, nickpiggin, arjan, linux-kernel From: Linus Torvalds <torvalds@osdl.org> Date: Wed, 27 Dec 2006 16:39:43 -0800 (PST) > > > On Wed, 27 Dec 2006, Linus Torvalds wrote: > > > > I think the test-case could probably be improved by having a munmap() and > > page-cache flush in between the writing and the checking, to see whether > > that shows the corruption easier (and possibly without having to start > > paging in order to throw the pages out, which would simplify testing a > > lot). > > I think the page-writeout is implicated, because I do seem to need it, but > the page-cache flush does seem to make corruption _easier_ to see. I now > seem about to trigger it with a 100MB file on a 256MB machine in a minute > or so, with this slight modification. > > I still don't see _why_, though. But maybe smarter people than me can see > it.. FWIW this program definitely triggers the bug for me. ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one 2006-12-28 0:52 ` David Miller @ 2006-12-28 3:04 ` Linus Torvalds 2006-12-28 4:32 ` Gordon Farquharson ` (4 more replies) 0 siblings, 5 replies; 311+ messages in thread From: Linus Torvalds @ 2006-12-28 3:04 UTC (permalink / raw) To: David Miller Cc: ranma, gordonfarquharson, tbm, Peter Zijlstra, andrei.popa, Andrew Morton, hugh, nickpiggin, arjan, Linux Kernel Mailing List On Wed, 27 Dec 2006, David Miller wrote: > > > > I still don't see _why_, though. But maybe smarter people than me can see > > it.. > > FWIW this program definitely triggers the bug for me. Ok, now that I have something simple to do repeatable stuff with, I can say what the pattern is.. It's not all that surprising, but it's still worth just stating for the record. What happens is that when I do the "packetized writes" in random order, the _last_ write to a page occasionally just goes missing. It's not always at the end of a page, as shown by for example: - A whole chunk got dropped: Chunk 2094 corrupted (0-1459) (1624-3083) Expected 46, got 0 Written as (30912)55414(10000) That "Written as (x)y(z)" line means that the corrupted chunk was written as chunk #y, and the preceding and following chunks (that were _not_ corrupt) on the page was written as #x and #z respectively. In other words, the missing chunk (which is still zero) was written much later than the ones that were ok, and never hit the disk. It's a contiguous chunk in the middle of the page (chunks are 1460 bytes in size) The first line means that all bytes of the chunk (0-1459) were corrupted, and the values in parenthesis are the offsets within a page. In other words, this was a chunk in the _middle_ of a page. - The missing data can also be at the beginning or ends of pages: Beginning of the chunk missing, it was at the end of a page (page offsets 3288-4095) and the _next_ page got written out fine: Chunk 2126 corrupted (0-807) (3288-4095) Expected 78, got 0 Written as (32713)55573(14301) End of a chunk missing, it was the beginning of a page (and the _previous_ page that contained the beginning of the chunk was written out fine) Chunk 2179 corrupted (1252-1459) (0-207) Expected 131, got 0 Written as (45189)55489(15515) Now, the reason I say this isn't surprising is that this is entirely consistent with the dirty bit being dropped on the floor somewhere, and likely through some interaction with the previous changes being in the process of being written out. Something (incorrectly) ends up deciding that it doesn't need to write the page, since it's already written, or alternatively clears the dirty bit too late (clears it because an _earlier_ write finished, never mind that the new dirty data didn't make it). I also figured out that it's not the low-memory situation that does it, it really must be the "page_mkclean()" triggering. Becuase I can do echo 5 > /proc/sys/vm/dirty_ratio echo 3 > /proc/sys/vm/dirty_background_ratio to make it clean the pages much more aggressively than the default, and I can see corruption on my 256MB machine with just a 40MB shared file, and 70MB of memory consistently free. So this thing is definitely giving some answers. It's NOT about low memory, and it very much seems to be about the whole "balance_dirty_ratio" thing. I don't think I triggered the actual low-memory stuff once in that situation.. So I have some more data on the behaviour, but I _still_ don't see the reason behind it. It's probably something really obvious once it's pointed out.. [ Modified test-program that tells you where the corruption happens (and when the missing parts were supposed to be written out) appended, in case people care. ] Linus --- #include <sys/mman.h> #include <sys/fcntl.h> #include <unistd.h> #include <stdlib.h> #include <string.h> #include <stdio.h> #include <time.h> #define TARGETSIZE (100 << 20) #define CHUNKSIZE (1460) #define NRCHUNKS (TARGETSIZE / CHUNKSIZE) #define SIZE (NRCHUNKS * CHUNKSIZE) static void fillmem(void *start, int nr) { memset(start, nr, CHUNKSIZE); } #define page_offset(buf, off) (0xfff & ((unsigned)(unsigned long)(buf)+(off))) static int chunkorder[NRCHUNKS]; static int order(int nr) { int i; if (nr < 0 || nr >= NRCHUNKS) return -1; for (i = 0; i < NRCHUNKS; i++) if (chunkorder[i] == nr) return i; return -2; } static void checkmem(void *buf, int nr) { unsigned int start = ~0u, end = 0; unsigned char c = nr, *p = buf, differs = 0; int i; for (i = 0; i < CHUNKSIZE; i++) { unsigned char got = *p++; if (got != c) { if (i < start) start = i; if (i > end) end = i; differs = got; } } if (start < end) { printf("Chunk %d corrupted (%u-%u) (%u-%u) \n", nr, start, end, page_offset(buf, start), page_offset(buf, end)); printf("Expected %u, got %u\n", c, differs); printf("Written as (%d)%d(%d)\n", order(nr-1), order(nr), order(nr+1)); } } static char *remap(int fd, char *mapping) { if (mapping) { munmap(mapping, SIZE); posix_fadvise(fd, 0, SIZE, POSIX_FADV_DONTNEED); } return mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); } int main(int argc, char **argv) { char *mapping; int fd, i; /* * Make some random ordering of writing the chunks to the * memory map.. * * Start with fully ordered.. */ for (i = 0; i < NRCHUNKS; i++) chunkorder[i] = i; /* ..and then mix it up randomly */ srandom(time(NULL)); for (i = 0; i < NRCHUNKS; i++) { int index = (unsigned int) random() % NRCHUNKS; int nr = chunkorder[index]; chunkorder[index] = chunkorder[i]; chunkorder[i] = nr; } fd = open("mapfile", O_RDWR | O_TRUNC | O_CREAT, 0666); if (fd < 0) return -1; if (ftruncate(fd, SIZE) < 0) return -1; mapping = remap(fd, NULL); if (-1 == (int)(long)mapping) return -1; for (i = 0; i < NRCHUNKS; i++) { int chunk = chunkorder[i]; printf("Writing chunk %d/%d (%d%%) \r", i, NRCHUNKS, 100*i/NRCHUNKS); fillmem(mapping + chunk * CHUNKSIZE, chunk); } printf("\n"); /* Unmap, drop, and remap.. */ mapping = remap(fd, mapping); /* .. and check */ for (i = 0; i < NRCHUNKS; i++) { int chunk = i; printf("Checking chunk %d/%d (%d%%) \r", i, NRCHUNKS, 100*i/NRCHUNKS); checkmem(mapping + chunk * CHUNKSIZE, chunk); } printf("\n"); return 0; } ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one 2006-12-28 3:04 ` Linus Torvalds @ 2006-12-28 4:32 ` Gordon Farquharson 2006-12-28 4:53 ` Linus Torvalds 2006-12-28 5:55 ` Chen, Kenneth W ` (3 subsequent siblings) 4 siblings, 1 reply; 311+ messages in thread From: Gordon Farquharson @ 2006-12-28 4:32 UTC (permalink / raw) To: Linus Torvalds Cc: David Miller, ranma, tbm, Peter Zijlstra, andrei.popa, Andrew Morton, hugh, nickpiggin, arjan, Linux Kernel Mailing List On 12/27/06, Linus Torvalds <torvalds@osdl.org> wrote: > [ Modified test-program that tells you where the corruption happens (and > when the missing parts were supposed to be written out) appended, in > case people care. ] For the record, this is the output from a run on our ARM machine (32 MB RAM) with 2.6.18 + the following patches: mm: tracking shared dirty pages mm: balance dirty pages mm: optimize the new mprotect() code a bit mm: small cleanup of install_page() mm: fixup do_wp_page() mm: msync() cleanup It is at all suprising that the second offset within a page can be less than the first offset within a page ? e.g. Chunk 260 corrupted (1-1455) (2769-127) $ ./linus-test Writing chunk 279/280 (99%) Chunk 256 corrupted (1-1455) (1025-2479) Expected 0, got 1 Written as (82)175(56) Chunk 258 corrupted (1-1455) (3945-1303) Expected 2, got 3 Written as (56)51(20) Chunk 260 corrupted (1-1455) (2769-127) Expected 4, got 5 Written as (20)30(18) Chunk 262 corrupted (1-1455) (1593-3047) Expected 6, got 7 Written as (18)196(158) Chunk 264 corrupted (1-1455) (417-1871) Expected 8, got 9 Written as (158)133(146) Chunk 266 corrupted (1-1455) (3337-695) Expected 10, got 11 Written as (146)43(77) Chunk 268 corrupted (1-1455) (2161-3615) Expected 12, got 13 Written as (77)251(211) Chunk 270 corrupted (1-1455) (985-2439) Expected 14, got 15 Written as (211)257(231) Chunk 272 corrupted (1-1455) (3905-1263) Expected 16, got 17 Written as (231)254(154) Chunk 274 corrupted (1-1455) (2729-87) Expected 18, got 19 Written as (154)11(85) Chunk 276 corrupted (1-1455) (1553-3007) Expected 20, got 21 Written as (85)230(134) Chunk 278 corrupted (1-1455) (377-1831) Expected 22, got 23 Written as (134)233(103) Checking chunk 279/280 (99%) Gordon -- Gordon Farquharson ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one 2006-12-28 4:32 ` Gordon Farquharson @ 2006-12-28 4:53 ` Linus Torvalds 2006-12-28 5:20 ` Gordon Farquharson [not found] ` <97a0a9ac0612272115g4cce1f08n3c3c8498a6076bd5@mail.gmail.com> 0 siblings, 2 replies; 311+ messages in thread From: Linus Torvalds @ 2006-12-28 4:53 UTC (permalink / raw) To: Gordon Farquharson Cc: David Miller, ranma, tbm, Peter Zijlstra, andrei.popa, Andrew Morton, hugh, nickpiggin, arjan, Linux Kernel Mailing List On Wed, 27 Dec 2006, Gordon Farquharson wrote: > > It is at all suprising that the second offset within a page can be > less than the first offset within a page ? e.g. > > Chunk 260 corrupted (1-1455) (2769-127) No, that just means that it went over to the next page (so you actually had two consecutive pages that weren't written out). That said, your output is very different from mine in another way. You don't have zeroes in your pages, rather the thing seems to have data from the next block (ie the chunk that should have 20 is reported as having 21 etc). You also have your offsets shifted up by one (ie offset 0 looks ok for you, and then you have a strange pattern of corruption at bytes 1...1455 instead of 0..1459. You also seem to have an example of the _earlier_ writes being corrupted, rather than the later ones. For example (but it's also a page-crosser, so maybe that's part of it): Chunk 274 corrupted (1-1455) (2729-87) Expected 18, got 19 Written as (154)11(85) says that block chunk 274 is the corrupt one, but it was written fairly early as #11, and the blocks around it (chunks 273 and 275) were actually written later. For all I know, my test-program is buggy wrt the ordering printouts, though. Did you perhaps change the logic in any way? Linus ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one 2006-12-28 4:53 ` Linus Torvalds @ 2006-12-28 5:20 ` Gordon Farquharson 2006-12-28 5:41 ` David Miller 2006-12-28 10:13 ` Russell King [not found] ` <97a0a9ac0612272115g4cce1f08n3c3c8498a6076bd5@mail.gmail.com> 1 sibling, 2 replies; 311+ messages in thread From: Gordon Farquharson @ 2006-12-28 5:20 UTC (permalink / raw) To: Linus Torvalds Cc: David Miller, ranma, tbm, Peter Zijlstra, andrei.popa, Andrew Morton, hugh, nickpiggin, arjan, Linux Kernel Mailing List [Oops - forgot to hit "Reply to All" first time round.] Hi Linus On 12/27/06, Linus Torvalds <torvalds@osdl.org> wrote: > For all I know, my test-program is buggy wrt the ordering printouts, > though. Did you perhaps change the logic in any way? I don't think so. I did reduce the target size #define TARGETSIZE (100 << 12) to make the program finish a little quicker, and for some reason I get linus-test.c: In function 'remap': linus-test.c:61: error: 'POSIX_FADV_DONTNEED' undeclared (first use in this function) when I compile the program, so I replaced POSIX_FADV_DONTNEED with 4 as defined in /usr/include/bits/fcntl.h. Other than these two changes, the program is identical to the version you posted. I have run the program a few times, and the output is pretty consistent. However, when I increase the target size, the difference between the expected and actual values is larger. Written as (749)935(738) Chunk 1113 corrupted (1-1455) (2965-323) Expected 89, got 93 Written as (935)738(538) Chunk 1114 corrupted (1-1455) (329-1783) Expected 90, got 94 Written as (738)538(678) Chunk 1115 corrupted (1-1455) (1789-3243) Expected 91, got 95 Written as (538)678(989) Chunk 1120 corrupted (1-1455) (897-2351) Expected 96, got 100 Written as (537)265(1005) Chunk 1121 corrupted (1-1455) (2357-3811) Expected 97, got 101 Written as (265)1005(-1) --- linus-test.c.orig 2006-12-28 06:17:24.000000000 +0100 +++ linus-test.c 2006-12-28 06:18:24.000000000 +0100 @@ -6,7 +6,7 @@ #include <stdio.h> #include <time.h> -#define TARGETSIZE (100 << 20) +#define TARGETSIZE (100 << 14) #define CHUNKSIZE (1460) #define NRCHUNKS (TARGETSIZE / CHUNKSIZE) #define SIZE (NRCHUNKS * CHUNKSIZE) @@ -61,7 +61,7 @@ { if (mapping) { munmap(mapping, SIZE); - posix_fadvise(fd, 0, SIZE, POSIX_FADV_DONTNEED); + posix_fadvise(fd, 0, SIZE, 4); } return mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); Gordon -- Gordon Farquharson ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one 2006-12-28 5:20 ` Gordon Farquharson @ 2006-12-28 5:41 ` David Miller 2006-12-28 5:47 ` Gordon Farquharson 2006-12-28 10:13 ` Russell King 1 sibling, 1 reply; 311+ messages in thread From: David Miller @ 2006-12-28 5:41 UTC (permalink / raw) To: gordonfarquharson Cc: torvalds, ranma, tbm, a.p.zijlstra, andrei.popa, akpm, hugh, nickpiggin, arjan, linux-kernel From: "Gordon Farquharson" <gordonfarquharson@gmail.com> Date: Wed, 27 Dec 2006 22:20:20 -0700 > and for some reason I get > > linus-test.c: In function 'remap': > linus-test.c:61: error: 'POSIX_FADV_DONTNEED' undeclared (first use in > this function) > > when I compile the program, so I replaced POSIX_FADV_DONTNEED with 4 > as defined in /usr/include/bits/fcntl.h. Me too, I added "-D_POSIX_C_SOURCE=200112" to "fix" this. Perhaps Linus's GCC sets that by default and our's doesn't. ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one 2006-12-28 5:41 ` David Miller @ 2006-12-28 5:47 ` Gordon Farquharson 0 siblings, 0 replies; 311+ messages in thread From: Gordon Farquharson @ 2006-12-28 5:47 UTC (permalink / raw) To: David Miller Cc: torvalds, ranma, tbm, a.p.zijlstra, andrei.popa, akpm, hugh, nickpiggin, arjan, linux-kernel Hi David On 12/27/06, David Miller <davem@davemloft.net> wrote: > Me too, I added "-D_POSIX_C_SOURCE=200112" to "fix" this. That works for me. Thanks for the tip. Gordon -- Gordon Farquharson ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one 2006-12-28 5:20 ` Gordon Farquharson 2006-12-28 5:41 ` David Miller @ 2006-12-28 10:13 ` Russell King 2006-12-28 14:15 ` Gordon Farquharson 2006-12-28 17:27 ` Linus Torvalds 1 sibling, 2 replies; 311+ messages in thread From: Russell King @ 2006-12-28 10:13 UTC (permalink / raw) To: Gordon Farquharson Cc: Linus Torvalds, David Miller, ranma, tbm, Peter Zijlstra, andrei.popa, Andrew Morton, hugh, nickpiggin, arjan, Linux Kernel Mailing List On Wed, Dec 27, 2006 at 10:20:20PM -0700, Gordon Farquharson wrote: > I have run the program a few times, and the output is pretty > consistent. However, when I increase the target size, the difference > between the expected and actual values is larger. > > Written as (749)935(738) > Chunk 1113 corrupted (1-1455) (2965-323) > Expected 89, got 93 This is not the corruption Linus is after. Note that the corruption starts at offset '1'. Also note that: 89 = 1113 & 255 93 = 1113 & 255 | (1113 >> 8) and if you look at glibc's memset() function, you'll notice that's exactly what you expect if you pass a non-8bit value to it. Ergo, what you're seeing is utterly expected given glibc's memset() implementation on ARM. Fixing Linus' test program to pass nr & 255 to memset results in clean passes on 2.6.9 on TheCus N2100 (IOP8032x) and 2.6.16.9 StrongARM machines (as would be expected.) -- Russell King Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one 2006-12-28 10:13 ` Russell King @ 2006-12-28 14:15 ` Gordon Farquharson 2006-12-28 15:53 ` Martin Michlmayr 2006-12-28 17:27 ` Linus Torvalds 1 sibling, 1 reply; 311+ messages in thread From: Gordon Farquharson @ 2006-12-28 14:15 UTC (permalink / raw) To: Gordon Farquharson, Linus Torvalds, David Miller, ranma, tbm, Peter Zijlstra, andrei.popa, Andrew Morton, hugh, nickpiggin, arjan, Linux Kernel Mailing List On 12/28/06, Russell King <rmk+lkml@arm.linux.org.uk> wrote: > Fixing Linus' test program to pass nr & 255 to memset results in clean > passes on 2.6.9 on TheCus N2100 (IOP8032x) and 2.6.16.9 StrongARM > machines (as would be expected.) Thanks for the fix, Russell. I can now trigger the (real) problem by using a 25 MB file (100 << 18) and the Linksys NSLU2 (ARM, IXP420 processor, 32 MB RAM). $ ./linus-test Writing chunk 17954/17955 (99%) Chunk 514 corrupted (0-1459) (872-2331) Expected 2, got 0 Written as (8479)11160(10312) Chunk 516 corrupted (0-303) (3792-4095) Expected 4, got 0 Written as (10312)10569(4426) Chunk 959 corrupted (0-691) (3404-4095) Expected 191, got 0 Written as (687)4881(1522) Chunk 1895 corrupted (0-1459) (1900-3359) Expected 103, got 0 Written as (7746)8389(6231) Chunk 2702 corrupted (0-1459) (472-1931) Expected 142, got 0 Written as (4866)7103(2409) Chunk 3314 corrupted (0-1459) (1064-2523) Expected 242, got 0 Written as (4287)7064(1730) Chunk 4043 corrupted (0-1459) (444-1903) Expected 203, got 0 Written as (6495)8509(4464) Chunk 5180 corrupted (0-1459) (1584-3043) Expected 60, got 0 Written as (11056)12826(10797) Chunk 5672 corrupted (0-991) (3104-4095) Expected 40, got 0 Written as (9944)4872(41) Chunk 5793 corrupted (460-1459) (0-999) Expected 161, got 0 Written as (7059)5038(4377) Chunk 6089 corrupted (0-1459) (1620-3079) Expected 201, got 0 Written as (4672)5230(4403) Chunk 6545 corrupted (268-1459) (0-1191) Expected 145, got 0 Written as (3701)5969(4668) Chunk 7578 corrupted (0-1459) (584-2043) Expected 154, got 0 Written as (10015)5082(1648) Chunk 7880 corrupted (864-1459) (0-595) Expected 200, got 0 Written as (17869)5064(4745) Chunk 8086 corrupted (0-1459) (888-2347) Expected 150, got 0 Written as (10206)11050(10374) Chunk 8749 corrupted (0-1459) (2212-3671) Expected 45, got 0 Written as (15263)7132(4825) Chunk 9068 corrupted (0-1459) (1008-2467) Expected 108, got 0 Written as (5557)7571(6771) Chunk 9193 corrupted (812-1459) (0-647) Expected 233, got 0 Written as (9238)7277(4757) Chunk 10032 corrupted (576-1459) (0-883) Expected 48, got 0 Written as (15741)10012(1753) Chunk 10056 corrupted (0-1459) (1696-3155) Expected 72, got 0 Written as (5379)7431(262) Chunk 10395 corrupted (0-1459) (1020-2479) Expected 155, got 0 Written as (21)7442(5902) Chunk 10791 corrupted (0-1459) (1644-3103) Expected 39, got 0 Written as (4753)5925(5926) Chunk 10792 corrupted (0-991) (3104-4095) Expected 40, got 0 Written as (5925)5926(8555) Chunk 11036 corrupted (0-1103) (2992-4095) Expected 28, got 0 Written as (13755)14449(7458) Chunk 11387 corrupted (644-1459) (0-815) Expected 123, got 0 Written as (10853)11459(9445) Chunk 11586 corrupted (920-1459) (0-539) Expected 66, got 0 Written as (3769)11691(11123) Chunk 11882 corrupted (0-1459) (1160-2619) Expected 106, got 0 Written as (10736)11696(2788) Chunk 12397 corrupted (0-603) (3492-4095) Expected 109, got 0 Written as (2352)7515(2437) Chunk 12669 corrupted (0-795) (3300-4095) Expected 125, got 0 Written as (1191)7661(5266) Chunk 13162 corrupted (0-1459) (2184-3643) Expected 106, got 0 Written as (9383)13662(11544) Chunk 14653 corrupted (0-27) (4068-4095) Expected 61, got 0 Written as (8100)9456(1275) Chunk 17332 corrupted (0-367) (3728-4095) Expected 180, got 0 Written as (760)12247(1244) Chunk 17445 corrupted (0-1459) (772-2231) Expected 37, got 0 Written as (8007)16481(14439) Chunk 17556 corrupted (0-1007) (3088-4095) Expected 148, got 0 Written as (10113)10657(10477) Chunk 17859 corrupted (0-995) (3100-4095) Expected 195, got 0 Written as (14472)14767(11426) Checking chunk 17954/17955 (99%) Gordon -- Gordon Farquharson ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one 2006-12-28 14:15 ` Gordon Farquharson @ 2006-12-28 15:53 ` Martin Michlmayr 0 siblings, 0 replies; 311+ messages in thread From: Martin Michlmayr @ 2006-12-28 15:53 UTC (permalink / raw) To: Gordon Farquharson Cc: Linus Torvalds, David Miller, ranma, Peter Zijlstra, andrei.popa, Andrew Morton, hugh, nickpiggin, arjan, Linux Kernel Mailing List * Gordon Farquharson <gordonfarquharson@gmail.com> [2006-12-28 07:15]: > Thanks for the fix, Russell. > > I can now trigger the (real) problem by using a 25 MB file (100 << 18) > and the Linksys NSLU2 (ARM, IXP420 processor, 32 MB RAM). Me too (using 100 << 18). Interestingly, I don't seem to get any corruption on a different ARM board, an IOP32x based machine with 128 MB RAM. -- Martin Michlmayr http://www.cyrius.com/ ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one 2006-12-28 10:13 ` Russell King 2006-12-28 14:15 ` Gordon Farquharson @ 2006-12-28 17:27 ` Linus Torvalds 2006-12-28 18:44 ` Russell King 1 sibling, 1 reply; 311+ messages in thread From: Linus Torvalds @ 2006-12-28 17:27 UTC (permalink / raw) To: Russell King Cc: Gordon Farquharson, David Miller, ranma, tbm, Peter Zijlstra, andrei.popa, Andrew Morton, hugh, nickpiggin, arjan, Linux Kernel Mailing List On Thu, 28 Dec 2006, Russell King wrote: > > and if you look at glibc's memset() function, you'll notice that's exactly > what you expect if you pass a non-8bit value to it. Ergo, what you're > seeing is utterly expected given glibc's memset() implementation on ARM. Guys, you _really_ should fix memset(). What you describe is a _bug_. "memset()" takes an "int" as its argument (always has), and has to convert it to a byte _itself_. It may not be common, but it's perfectly normal, to pass it values outside 0-255 (negative values that still fit in a "signed char" in particular are very normal, but my usage of "let the thing truncate it itself" is also quite fine). > Fixing Linus' test program to pass nr & 255 to memset No. I'm almost certain that that is not a "fix", it's a workaround for a serious bug in your glibc crap. But it does explain all the unexpected strange behaviour (and the really small writeback size - now it doesn't need any /proc/sys/vm/dirty_ratio assumptions to be explicable. Linus ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one 2006-12-28 17:27 ` Linus Torvalds @ 2006-12-28 18:44 ` Russell King 2006-12-28 19:01 ` Linus Torvalds 0 siblings, 1 reply; 311+ messages in thread From: Russell King @ 2006-12-28 18:44 UTC (permalink / raw) To: Linus Torvalds Cc: Gordon Farquharson, David Miller, ranma, tbm, Peter Zijlstra, andrei.popa, Andrew Morton, hugh, nickpiggin, arjan, Linux Kernel Mailing List On Thu, Dec 28, 2006 at 09:27:12AM -0800, Linus Torvalds wrote: > On Thu, 28 Dec 2006, Russell King wrote: > > and if you look at glibc's memset() function, you'll notice that's exactly > > what you expect if you pass a non-8bit value to it. Ergo, what you're > > seeing is utterly expected given glibc's memset() implementation on ARM. > > Guys, you _really_ should fix memset(). What you describe is a _bug_. Yup, but I have nothing to do with glibc because I refuse to do that silly copyright assignment FSF thing. Hopefully someone else can resolve it, but... > > Fixing Linus' test program to pass nr & 255 to memset > > No. I'm almost certain that that is not a "fix", it's a workaround for a > serious bug in your glibc crap. _is_ a fix whether _you_ like it or not to work around the issue so people can at least run your test program. I'm not saying it's a proper fix though. Of course, if you prefer to be mislead by incorrect bug reports... -- Russell King Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one 2006-12-28 18:44 ` Russell King @ 2006-12-28 19:01 ` Linus Torvalds 0 siblings, 0 replies; 311+ messages in thread From: Linus Torvalds @ 2006-12-28 19:01 UTC (permalink / raw) To: Russell King Cc: Gordon Farquharson, David Miller, ranma, tbm, Peter Zijlstra, andrei.popa, Andrew Morton, hugh, nickpiggin, arjan, Linux Kernel Mailing List On Thu, 28 Dec 2006, Russell King wrote: > > Yup, but I have nothing to do with glibc because I refuse to do that > silly copyright assignment FSF thing. Hopefully someone else can > resolve it, but... Yeah, me too. > _is_ a fix whether _you_ like it or not to work around the issue so > people can at least run your test program. I'm not saying it's a > proper fix though. My point was that it wasn't a "fix", it's a "workaround". The _fix_ would be in glibc. Nothing more.. Linus ^ permalink raw reply [flat|nested] 311+ messages in thread
[parent not found: <97a0a9ac0612272115g4cce1f08n3c3c8498a6076bd5@mail.gmail.com>]
[parent not found: <Pine.LNX.4.64.0612272120180.4473@woody.osdl.org>]
* Re: [PATCH] mm: fix page_mkclean_one [not found] ` <Pine.LNX.4.64.0612272120180.4473@woody.osdl.org> @ 2006-12-28 5:38 ` Gordon Farquharson 2006-12-28 9:30 ` Martin Michlmayr 2006-12-28 10:16 ` Martin Michlmayr 2006-12-28 5:58 ` Gordon Farquharson 1 sibling, 2 replies; 311+ messages in thread From: Gordon Farquharson @ 2006-12-28 5:38 UTC (permalink / raw) To: Linus Torvalds Cc: David Miller, ranma, tbm, Peter Zijlstra, andrei.popa, Andrew Morton, hugh, nickpiggin, arjan, Linux Kernel Mailing List On 12/27/06, Linus Torvalds <torvalds@osdl.org> wrote: > On Wed, 27 Dec 2006, Gordon Farquharson wrote: > > > > I don't think so. I did reduce the target size > > > > #define TARGETSIZE (100 << 12) > > That's just 400kB! > > There's no way you should see corruption with that kind of value. It > should all stay solidly in the cache. > > Is this perhaps with ARM nommu or something else strange? It may be that > the program just doesn't work at all if mmap() is faked out with a malloc > or similar. Definitely a question for the ARM gurus. I'm out of my depth. Gordon -- Gordon Farquharson ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one 2006-12-28 5:38 ` Gordon Farquharson @ 2006-12-28 9:30 ` Martin Michlmayr 2006-12-28 10:16 ` Martin Michlmayr 1 sibling, 0 replies; 311+ messages in thread From: Martin Michlmayr @ 2006-12-28 9:30 UTC (permalink / raw) To: Gordon Farquharson Cc: Linus Torvalds, David Miller, ranma, Peter Zijlstra, andrei.popa, Andrew Morton, hugh, nickpiggin, arjan, Linux Kernel Mailing List * Gordon Farquharson <gordonfarquharson@gmail.com> [2006-12-27 22:38]: > >That's just 400kB! > > > >There's no way you should see corruption with that kind of value. It > >should all stay solidly in the cache. > > > >Is this perhaps with ARM nommu or something else strange? It may be that > >the program just doesn't work at all if mmap() is faked out with a malloc > >or similar. > > Definitely a question for the ARM gurus. I'm out of my depth. The CPU has a MMU. For reference, it's a IXP4xx based device with 32 MB of memory. -- Martin Michlmayr http://www.cyrius.com/ ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one 2006-12-28 5:38 ` Gordon Farquharson 2006-12-28 9:30 ` Martin Michlmayr @ 2006-12-28 10:16 ` Martin Michlmayr 2006-12-28 10:49 ` Russell King 1 sibling, 1 reply; 311+ messages in thread From: Martin Michlmayr @ 2006-12-28 10:16 UTC (permalink / raw) To: Gordon Farquharson Cc: Linus Torvalds, David Miller, ranma, Peter Zijlstra, andrei.popa, Andrew Morton, hugh, nickpiggin, arjan, Linux Kernel Mailing List * Gordon Farquharson <gordonfarquharson@gmail.com> [2006-12-27 22:38]: > >> #define TARGETSIZE (100 << 12) > > > >That's just 400kB! > > > >There's no way you should see corruption with that kind of value. It > >should all stay solidly in the cache. > > > >Is this perhaps with ARM nommu or something else strange? It may be that > >the program just doesn't work at all if mmap() is faked out with a malloc > >or similar. > > Definitely a question for the ARM gurus. I'm out of my depth. By the way, I just tried it with TARGETSIZE (100 << 12) on a different ARM machine (a Thecus N2100 based on an IOP32x chip with 128 MB of memory) and I see similar results to that from Gordon: Writing chunk 279/280 (99%) Chunk 256 corrupted (1-1455) (1025-2479) Expected 0, got 1 Written as (199)43(184) Chunk 258 corrupted (1-1455) (3945-1303) Expected 2, got 3 Written as (184)74(145) Chunk 260 corrupted (1-1455) (2769-127) Expected 4, got 5 Written as (145)89(237) Chunk 262 corrupted (1-1455) (1593-3047) Expected 6, got 7 Written as (237)168(174) Chunk 264 corrupted (1-1455) (417-1871) Expected 8, got 9 Written as (174)135(161) Chunk 266 corrupted (1-1455) (3337-695) Expected 10, got 11 Written as (161)123(180) Chunk 268 corrupted (1-1455) (2161-3615) Expected 12, got 13 Written as (180)13(19) Chunk 270 corrupted (1-1455) (985-2439) Expected 14, got 15 Written as (19)172(106) Chunk 272 corrupted (1-1455) (3905-1263) Expected 16, got 17 Written as (106)212(140) Chunk 274 corrupted (1-1455) (2729-87) Expected 18, got 19 Written as (140)124(73) Chunk 276 corrupted (1-1455) (1553-3007) Expected 20, got 21 Written as (73)151(52) Chunk 278 corrupted (1-1455) (377-1831) Expected 22, got 23 Written as (52)215(99) Checking chunk 279/280 (99%) -- Martin Michlmayr http://www.cyrius.com/ ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one 2006-12-28 10:16 ` Martin Michlmayr @ 2006-12-28 10:49 ` Russell King 2006-12-28 14:56 ` Martin Michlmayr 0 siblings, 1 reply; 311+ messages in thread From: Russell King @ 2006-12-28 10:49 UTC (permalink / raw) To: Martin Michlmayr Cc: Gordon Farquharson, Linus Torvalds, David Miller, ranma, Peter Zijlstra, andrei.popa, Andrew Morton, hugh, nickpiggin, arjan, Linux Kernel Mailing List On Thu, Dec 28, 2006 at 11:16:59AM +0100, Martin Michlmayr wrote: > * Gordon Farquharson <gordonfarquharson@gmail.com> [2006-12-27 22:38]: > > >> #define TARGETSIZE (100 << 12) > > > > > >That's just 400kB! > > > > > >There's no way you should see corruption with that kind of value. It > > >should all stay solidly in the cache. > > > > > >Is this perhaps with ARM nommu or something else strange? It may be that > > >the program just doesn't work at all if mmap() is faked out with a malloc > > >or similar. > > > > Definitely a question for the ARM gurus. I'm out of my depth. > > By the way, I just tried it with TARGETSIZE (100 << 12) on a different > ARM machine (a Thecus N2100 based on an IOP32x chip with 128 MB of > memory) and I see similar results to that from Gordon: Work around the glibc memset() problem by passing nr & 255, and re-run the test. You're getting false positives. -- Russell King Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one 2006-12-28 10:49 ` Russell King @ 2006-12-28 14:56 ` Martin Michlmayr 0 siblings, 0 replies; 311+ messages in thread From: Martin Michlmayr @ 2006-12-28 14:56 UTC (permalink / raw) To: Gordon Farquharson, Linus Torvalds, David Miller, ranma, Peter Zijlstra, andrei.popa, Andrew Morton, hugh, nickpiggin, arjan, Linux Kernel Mailing List * Russell King <rmk+lkml@arm.linux.org.uk> [2006-12-28 10:49]: > > By the way, I just tried it with TARGETSIZE (100 << 12) on a different > > ARM machine (a Thecus N2100 based on an IOP32x chip with 128 MB of > > memory) and I see similar results to that from Gordon: > > Work around the glibc memset() problem by passing nr & 255, and re-run > the test. You're getting false positives. Yeah, I saw your message about this problem after I sent mine. -- Martin Michlmayr http://www.cyrius.com/ ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one [not found] ` <Pine.LNX.4.64.0612272120180.4473@woody.osdl.org> 2006-12-28 5:38 ` Gordon Farquharson @ 2006-12-28 5:58 ` Gordon Farquharson 2006-12-28 17:08 ` Linus Torvalds 1 sibling, 1 reply; 311+ messages in thread From: Gordon Farquharson @ 2006-12-28 5:58 UTC (permalink / raw) To: Linus Torvalds Cc: David Miller, ranma, tbm, Peter Zijlstra, andrei.popa, Andrew Morton, hugh, nickpiggin, arjan, Linux Kernel Mailing List On 12/27/06, Linus Torvalds <torvalds@osdl.org> wrote: > That's just 400kB! > > There's no way you should see corruption with that kind of value. It > should all stay solidly in the cache. 100kB and 200kB files always succeed on the ARM system. 400kB and larger always seem to fail. Does the following help interpret the results on ARM at all ? $ free total used free shared buffers cached Mem: 30000 23620 6380 0 808 15676 -/+ buffers/cache: 7136 22864 Swap: 88316 3664 84652 Gordon -- Gordon Farquharson ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one 2006-12-28 5:58 ` Gordon Farquharson @ 2006-12-28 17:08 ` Linus Torvalds 0 siblings, 0 replies; 311+ messages in thread From: Linus Torvalds @ 2006-12-28 17:08 UTC (permalink / raw) To: Gordon Farquharson Cc: David Miller, ranma, tbm, Peter Zijlstra, andrei.popa, Andrew Morton, hugh, nickpiggin, arjan, Linux Kernel Mailing List On Wed, 27 Dec 2006, Gordon Farquharson wrote: > > 100kB and 200kB files always succeed on the ARM system. 400kB and > larger always seem to fail. Oh, wow. Yeah, I've just repressed how tiny 32MB is. And especially if you lowered the /proc/sys/vm/dirty_ratio to a smaller percentage, I guess 400kB should be enough to cause writeback. Ugh. I tested a 128MB machine a few weeks ago, and found it painful. Linus ^ permalink raw reply [flat|nested] 311+ messages in thread
* RE: [PATCH] mm: fix page_mkclean_one 2006-12-28 3:04 ` Linus Torvalds 2006-12-28 4:32 ` Gordon Farquharson @ 2006-12-28 5:55 ` Chen, Kenneth W 2006-12-28 6:10 ` Chen, Kenneth W 2006-12-28 9:15 ` Zhang, Yanmin ` (2 subsequent siblings) 4 siblings, 1 reply; 311+ messages in thread From: Chen, Kenneth W @ 2006-12-28 5:55 UTC (permalink / raw) To: 'Linus Torvalds', David Miller Cc: ranma, gordonfarquharson, tbm, Peter Zijlstra, andrei.popa, Andrew Morton, hugh, nickpiggin, arjan, Linux Kernel Mailing List Linus Torvalds wrote on Wednesday, December 27, 2006 7:05 PM > On Wed, 27 Dec 2006, David Miller wrote: > > > > > > I still don't see _why_, though. But maybe smarter people than me can see > > > it.. > > > > FWIW this program definitely triggers the bug for me. > > Ok, now that I have something simple to do repeatable stuff with, I can > say what the pattern is.. It's not all that surprising, but it's still > worth just stating for the record. Running the test code, git bisect points its finger at this commit. Reverting this commit on top of 2.6.20-rc2 doesn't trigger the bug from the test code. edc79b2a46ed854595e40edcf3f8b37f9f14aa3f is first bad commit commit edc79b2a46ed854595e40edcf3f8b37f9f14aa3f Author: Peter Zijlstra <a.p.zijlstra@chello.nl> Date: Mon Sep 25 23:30:58 2006 -0700 [PATCH] mm: balance dirty pages Now that we can detect writers of shared mappings, throttle them. Avoids OOM by surprise. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org> ^ permalink raw reply [flat|nested] 311+ messages in thread
* RE: [PATCH] mm: fix page_mkclean_one 2006-12-28 5:55 ` Chen, Kenneth W @ 2006-12-28 6:10 ` Chen, Kenneth W 2006-12-28 6:27 ` David Miller 2006-12-28 17:10 ` Linus Torvalds 0 siblings, 2 replies; 311+ messages in thread From: Chen, Kenneth W @ 2006-12-28 6:10 UTC (permalink / raw) To: 'Linus Torvalds', David Miller Cc: ranma, gordonfarquharson, tbm, Peter Zijlstra, andrei.popa, Andrew Morton, hugh, nickpiggin, arjan, Linux Kernel Mailing List Chen, Kenneth wrote on Wednesday, December 27, 2006 9:55 PM > Linus Torvalds wrote on Wednesday, December 27, 2006 7:05 PM > > On Wed, 27 Dec 2006, David Miller wrote: > > > > > > > > I still don't see _why_, though. But maybe smarter people than me can see > > > > it.. > > > > > > FWIW this program definitely triggers the bug for me. > > > > Ok, now that I have something simple to do repeatable stuff with, I can > > say what the pattern is.. It's not all that surprising, but it's still > > worth just stating for the record. > > > Running the test code, git bisect points its finger at this commit. Reverting > this commit on top of 2.6.20-rc2 doesn't trigger the bug from the test code. > > edc79b2a46ed854595e40edcf3f8b37f9f14aa3f is first bad commit > commit edc79b2a46ed854595e40edcf3f8b37f9f14aa3f > Author: Peter Zijlstra <a.p.zijlstra@chello.nl> > Date: Mon Sep 25 23:30:58 2006 -0700 > > [PATCH] mm: balance dirty pages > > Now that we can detect writers of shared mappings, throttle them. Avoids OOM > by surprise. Oh, never mind :-( I just didn't create enough write out pressure when test this. I just saw bug got triggered on a kernel I previously thought was OK. ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one 2006-12-28 6:10 ` Chen, Kenneth W @ 2006-12-28 6:27 ` David Miller 2006-12-28 17:10 ` Linus Torvalds 1 sibling, 0 replies; 311+ messages in thread From: David Miller @ 2006-12-28 6:27 UTC (permalink / raw) To: kenneth.w.chen Cc: torvalds, ranma, gordonfarquharson, tbm, a.p.zijlstra, andrei.popa, akpm, hugh, nickpiggin, arjan, linux-kernel From: "Chen, Kenneth W" <kenneth.w.chen@intel.com> Date: Wed, 27 Dec 2006 22:10:52 -0800 > Chen, Kenneth wrote on Wednesday, December 27, 2006 9:55 PM > > Linus Torvalds wrote on Wednesday, December 27, 2006 7:05 PM > > > On Wed, 27 Dec 2006, David Miller wrote: > > > > > > > > > > I still don't see _why_, though. But maybe smarter people than me can see > > > > > it.. > > > > > > > > FWIW this program definitely triggers the bug for me. > > > > > > Ok, now that I have something simple to do repeatable stuff with, I can > > > say what the pattern is.. It's not all that surprising, but it's still > > > worth just stating for the record. > > > > > > Running the test code, git bisect points its finger at this commit. Reverting > > this commit on top of 2.6.20-rc2 doesn't trigger the bug from the test code. > > > > edc79b2a46ed854595e40edcf3f8b37f9f14aa3f is first bad commit > > commit edc79b2a46ed854595e40edcf3f8b37f9f14aa3f > > Author: Peter Zijlstra <a.p.zijlstra@chello.nl> > > Date: Mon Sep 25 23:30:58 2006 -0700 > > > > [PATCH] mm: balance dirty pages > > > > Now that we can detect writers of shared mappings, throttle them. Avoids OOM > > by surprise. > > > Oh, never mind :-( I just didn't create enough write out pressure when > test this. I just saw bug got triggered on a kernel I previously thought > was OK. Besides, I'm pretty sure that from the Debian bug entry it's been established that the dirty-page tracking changes from a few releases ago introduced this problem. ^ permalink raw reply [flat|nested] 311+ messages in thread
* RE: [PATCH] mm: fix page_mkclean_one 2006-12-28 6:10 ` Chen, Kenneth W 2006-12-28 6:27 ` David Miller @ 2006-12-28 17:10 ` Linus Torvalds 1 sibling, 0 replies; 311+ messages in thread From: Linus Torvalds @ 2006-12-28 17:10 UTC (permalink / raw) To: Chen, Kenneth W Cc: David Miller, ranma, gordonfarquharson, tbm, Peter Zijlstra, andrei.popa, Andrew Morton, hugh, nickpiggin, arjan, Linux Kernel Mailing List On Wed, 27 Dec 2006, Chen, Kenneth W wrote: > > > > Running the test code, git bisect points its finger at this commit. Reverting > > this commit on top of 2.6.20-rc2 doesn't trigger the bug from the test code. > > > > [PATCH] mm: balance dirty pages > > > > Now that we can detect writers of shared mappings, throttle them. Avoids OOM > > by surprise. > > Oh, never mind :-( I just didn't create enough write out pressure when > test this. I just saw bug got triggered on a kernel I previously thought > was OK. Btw, this is an important point - people have long felt that the new page balancing in 2.6.19 was to blame, but you've just confirmed the long-held suspicion (at least by me) that it's not actually a new bug at all, it's just that the dirty page balancing causes writeback to happen _earlier_, and thus is better able to _show_ a bug that we've likely had for a long long time. Linus ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one 2006-12-28 3:04 ` Linus Torvalds 2006-12-28 4:32 ` Gordon Farquharson 2006-12-28 5:55 ` Chen, Kenneth W @ 2006-12-28 9:15 ` Zhang, Yanmin 2006-12-28 17:15 ` Linus Torvalds 2006-12-28 11:50 ` Petri Kaukasoina 2006-12-28 15:09 ` Guillaume Chazarain 4 siblings, 1 reply; 311+ messages in thread From: Zhang, Yanmin @ 2006-12-28 9:15 UTC (permalink / raw) To: Linus Torvalds Cc: David Miller, ranma, gordonfarquharson, tbm, Peter Zijlstra, andrei.popa, Andrew Morton, hugh, nickpiggin, arjan, Linux Kernel Mailing List On Wed, 2006-12-27 at 19:04 -0800, Linus Torvalds wrote: > > On Wed, 27 Dec 2006, David Miller wrote: > > > > > > I still don't see _why_, though. But maybe smarter people than me can see > > > it.. > > > > FWIW this program definitely triggers the bug for me. > > Ok, now that I have something simple to do repeatable stuff with, I can > say what the pattern is.. It's not all that surprising, but it's still > worth just stating for the record. > > What happens is that when I do the "packetized writes" in random order, > the _last_ write to a page occasionally just goes missing. It's not always > at the end of a page, as shown by for example: > > - A whole chunk got dropped: > > Chunk 2094 corrupted (0-1459) (1624-3083) > Expected 46, got 0 > Written as (30912)55414(10000) > > That "Written as (x)y(z)" line means that the corrupted chunk was > written as chunk #y, and the preceding and following chunks (that were > _not_ corrupt) on the page was written as #x and #z respectively. > > In other words, the missing chunk (which is still zero) was written > much later than the ones that were ok, and never hit the disk. It's a > contiguous chunk in the middle of the page (chunks are 1460 bytes in > size) > > The first line means that all bytes of the chunk (0-1459) were > corrupted, and the values in parenthesis are the offsets within a page. > In other words, this was a chunk in the _middle_ of a page. > > - The missing data can also be at the beginning or ends of pages: > > Beginning of the chunk missing, it was at the end of a page (page > offsets 3288-4095) and the _next_ page got written out fine: > > Chunk 2126 corrupted (0-807) (3288-4095) > Expected 78, got 0 > Written as (32713)55573(14301) > > End of a chunk missing, it was the beginning of a page (and the > _previous_ page that contained the beginning of the chunk was written > out fine) > > Chunk 2179 corrupted (1252-1459) (0-207) > Expected 131, got 0 > Written as (45189)55489(15515) > > Now, the reason I say this isn't surprising is that this is entirely > consistent with the dirty bit being dropped on the floor somewhere, and > likely through some interaction with the previous changes being in the > process of being written out. > > Something (incorrectly) ends up deciding that it doesn't need to write the > page, since it's already written, or alternatively clears the dirty bit > too late (clears it because an _earlier_ write finished, never mind that > the new dirty data didn't make it). There might be a narrow race between set_page_dirty and clear_page_dirty. The test program is a process to write/read data. pdflush might write data to disk asynchronously. After pdflush writes a page to disk, it will call (either by softirq) clear_page_dirty to clear the dirty bit after getting the interrupt notification. But just after the page is written to disk and before the interrupt reports the result, the test process might change the data and unmap the area. When the area is unmapped, the page is marked as dirty again, but just after that, the interrupt arrives and the dirty bit is cleared, so the late data will not be written to disk. Function zap_pte_range checks pte to set page dirty if needed, but it doesn't hold page lock. If the page lock is held before set page dirty, the race might be avoided. Yanmin ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one 2006-12-28 9:15 ` Zhang, Yanmin @ 2006-12-28 17:15 ` Linus Torvalds 0 siblings, 0 replies; 311+ messages in thread From: Linus Torvalds @ 2006-12-28 17:15 UTC (permalink / raw) To: Zhang, Yanmin Cc: David Miller, ranma, gordonfarquharson, tbm, Peter Zijlstra, andrei.popa, Andrew Morton, hugh, nickpiggin, arjan, Linux Kernel Mailing List On Thu, 28 Dec 2006, Zhang, Yanmin wrote: > > The test program is a process to write/read data. pdflush might write data > to disk asynchronously. After pdflush writes a page to disk, it will call (either by > softirq) clear_page_dirty to clear the dirty bit after getting the interrupt > notification. That would indeed be a horrible bug. However, we don't do "clear_page_dirty()" _after_ the IO has completed, we do it _before_ the IO starts. If you can actually find a place that does clear_page_dirty _after_ IO, then yes, you've just found the bug. But I haven't found it. So the rule is _always_: - call "clear_page_dirty_for_io()" with the page lock held, and _before_ the IO starts. - do "set_page_writeback()" before unlocking the page again - do a "end_page_writeback()" when the IO actually finishes. and any code sequence that doesn't honor those rules would be buggy. A beer for anybody that finds it.. Linus ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one 2006-12-28 3:04 ` Linus Torvalds ` (2 preceding siblings ...) 2006-12-28 9:15 ` Zhang, Yanmin @ 2006-12-28 11:50 ` Petri Kaukasoina 2006-12-28 15:09 ` Guillaume Chazarain 4 siblings, 0 replies; 311+ messages in thread From: Petri Kaukasoina @ 2006-12-28 11:50 UTC (permalink / raw) To: Linus Torvalds Cc: David Miller, ranma, gordonfarquharson, tbm, Peter Zijlstra, andrei.popa, Andrew Morton, hugh, nickpiggin, arjan, Linux Kernel Mailing List On Wed, Dec 27, 2006 at 07:04:34PM -0800, Linus Torvalds wrote: > [ Modified test-program that tells you where the corruption happens (and > when the missing parts were supposed to be written out) appended, in > case people care. ] Hi 2.6.18 (and 2.6.18.6) is ok, 2.6.19-rc1 is broken. I tried some snapshots between them but they hung before shell (2.6.18-git11, 2.6.18-git16, 2.6.18-git20, 2.6.18-git21). 2.6.18-git22 booted and was broken. (UP, no preempt) -Petri ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: Re: [PATCH] mm: fix page_mkclean_one 2006-12-28 3:04 ` Linus Torvalds ` (3 preceding siblings ...) 2006-12-28 11:50 ` Petri Kaukasoina @ 2006-12-28 15:09 ` Guillaume Chazarain 2006-12-28 19:19 ` Guillaume Chazarain 4 siblings, 1 reply; 311+ messages in thread From: Guillaume Chazarain @ 2006-12-28 15:09 UTC (permalink / raw) To: Linus Torvalds Cc: David Miller, ranma, gordonfarquharson, tbm, Peter Zijlstra, andrei.popa, Andrew Morton, hugh, nickpiggin, arjan, Linux Kernel Mailing List I set a qemu environment to test kernels: http://guichaz.free.fr/linux-bug/ I have corruption with every Fedora release kernel except the first, that is 2.4.22 works, but 2.6.5, 2.6.9, 2.6.11, 2.6.15 and 2.6.18-1.2798 exhibit some corruption. Command line to test: qemu root_fs -snapshot -kernel FC-kernels/FC2-vmlinuz-2.6.5-1.358 -append 'rw root=/dev/hda' I get this kind of corruption: http://guichaz.free.fr/linux-bug/corruption.png -- Guillaume ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one 2006-12-28 15:09 ` Guillaume Chazarain @ 2006-12-28 19:19 ` Guillaume Chazarain 2006-12-28 19:28 ` Linus Torvalds 0 siblings, 1 reply; 311+ messages in thread From: Guillaume Chazarain @ 2006-12-28 19:19 UTC (permalink / raw) To: Linus Torvalds Cc: David Miller, ranma, gordonfarquharson, tbm, Peter Zijlstra, andrei.popa, Andrew Morton, hugh, nickpiggin, arjan, Linux Kernel Mailing List, Chen Kenneth W [-- Attachment #1: Type: text/plain, Size: 625 bytes --] Guillaume Chazarain a écrit : > I get this kind of corruption: > http://guichaz.free.fr/linux-bug/corruption.png Actually in qemu, I get three different behaviours: - no corruption at all : with linux-2.4 - corruption only on the first chunks: before [PATCH] mm: balance dirty pages as identified by Kenneth - corruption of all chunks: after the balance dirty pages patch Bisecting in linux-2.5 land I found http://kernel.org/pub/linux/kernel/people/akpm/patches/2.5/2.5.66/2.5.66-mm3/broken-out/fadvise-flush-data.patch to cause the corruption for me. The attached patch fixes the corruption for me. -- Guillaume [-- Attachment #2: fadvise-dontneed.patch --] [-- Type: text/x-patch, Size: 492 bytes --] diff -r 3859b1144d3a mm/fadvise.c --- a/mm/fadvise.c Sun Dec 24 05:00:03 2006 +0000 +++ b/mm/fadvise.c Thu Dec 28 19:53:40 2006 +0100 @@ -96,9 +96,6 @@ asmlinkage long sys_fadvise64_64(int fd, case POSIX_FADV_NOREUSE: break; case POSIX_FADV_DONTNEED: - if (!bdi_write_congested(mapping->backing_dev_info)) - filemap_flush(mapping); - /* First and last FULL page! */ start_index = (offset+(PAGE_CACHE_SIZE-1)) >> PAGE_CACHE_SHIFT; end_index = (endbyte >> PAGE_CACHE_SHIFT); ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one 2006-12-28 19:19 ` Guillaume Chazarain @ 2006-12-28 19:28 ` Linus Torvalds 2006-12-28 19:45 ` Andrew Morton 0 siblings, 1 reply; 311+ messages in thread From: Linus Torvalds @ 2006-12-28 19:28 UTC (permalink / raw) To: Guillaume Chazarain Cc: David Miller, ranma, gordonfarquharson, tbm, Peter Zijlstra, andrei.popa, Andrew Morton, hugh, nickpiggin, arjan, Linux Kernel Mailing List, Chen Kenneth W On Thu, 28 Dec 2006, Guillaume Chazarain wrote: > > The attached patch fixes the corruption for me. Well, that's a good hint, but it's really just a symptom. You effectively just made the test-program not even try to flush the data to disk, so the page cache would stay in memory, and you'd not see the corruption as well. So you basically disabled the code that tried to trigger the bug more easily. But the reason I say it's interesting is that "WB_SYNC_NONE" is very much implicated in mm/page-writeback.c, and if there is a bug triggered by WB_SYNC_NONE writebacks, then that would explain why page-writeback.c also fails.. Linus ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one 2006-12-28 19:28 ` Linus Torvalds @ 2006-12-28 19:45 ` Andrew Morton 2006-12-28 20:14 ` Linus Torvalds 2006-12-28 22:35 ` [PATCH] mm: fix page_mkclean_one Mike Galbraith 0 siblings, 2 replies; 311+ messages in thread From: Andrew Morton @ 2006-12-28 19:45 UTC (permalink / raw) To: Linus Torvalds Cc: Guillaume Chazarain, David Miller, ranma, gordonfarquharson, tbm, Peter Zijlstra, andrei.popa, hugh, nickpiggin, arjan, Linux Kernel Mailing List, Chen Kenneth W On Thu, 28 Dec 2006 11:28:52 -0800 (PST) Linus Torvalds <torvalds@osdl.org> wrote: > > > On Thu, 28 Dec 2006, Guillaume Chazarain wrote: > > > > The attached patch fixes the corruption for me. > > Well, that's a good hint, but it's really just a symptom. You effectively > just made the test-program not even try to flush the data to disk, so the > page cache would stay in memory, and you'd not see the corruption as well. > > So you basically disabled the code that tried to trigger the bug more > easily. > > But the reason I say it's interesting is that "WB_SYNC_NONE" is very much > implicated in mm/page-writeback.c, and if there is a bug triggered by > WB_SYNC_NONE writebacks, then that would explain why page-writeback.c also > fails.. > It would be interesting to convert your app to do fsync() before FADV_DONTNEED. That would take WB_SYNC_NONE out of the picture as well (apart from pdflush activity). ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one 2006-12-28 19:45 ` Andrew Morton @ 2006-12-28 20:14 ` Linus Torvalds 2006-12-28 22:38 ` David Miller 2006-12-28 22:35 ` [PATCH] mm: fix page_mkclean_one Mike Galbraith 1 sibling, 1 reply; 311+ messages in thread From: Linus Torvalds @ 2006-12-28 20:14 UTC (permalink / raw) To: Andrew Morton Cc: Guillaume Chazarain, David Miller, ranma, gordonfarquharson, tbm, Peter Zijlstra, andrei.popa, hugh, nickpiggin, arjan, Linux Kernel Mailing List, Chen Kenneth W [-- Attachment #1: Type: TEXT/PLAIN, Size: 2468 bytes --] On Thu, 28 Dec 2006, Andrew Morton wrote: > > It would be interesting to convert your app to do fsync() before > FADV_DONTNEED. That would take WB_SYNC_NONE out of the picture as well > (apart from pdflush activity). I get corruption - but the whole point is that it's very much pdflush that should be writing these pages out. Andrew - give my test-program a try. It can run in about 1 minute if you have a 256MB machine (I didn't, but "mem=256M" is my friend), and it seems to very consistently cause corruption. What I do is: # Make sure we write back aggressively echo 5 > /proc/sys/vm/dirty_ratio as root, and then just run the thing. Tons of corruption. But the corruption goes away if I just leave the default dirty ratio alone (but then I can increse the file size to trigger it, of course - but that also makes the test run a lot slower). Now, with a pre-2.6.19 kernel, I bet you won't get the corruption as easily (at least with the "fsync()"), but that's less to do with anything new, and probably just because then you simply won't have any pdflushing going on - since the kernel won't even notice that you have tons of dirty pages ;) It might also depend on the speed of your disk drive - the machine I test this on has a slow 4200 rpm laptop drive in it, and that probably makes things go south more easily. That's _especially_ true if this is related to any "bdi_write_congested()" logic. Now, it could also be related to various code snippets like ... if (wbc->sync_mode != WB_SYNC_NONE) wait_on_page_writeback(page); if (PageWriteback(page) || !clear_page_dirty_for_io(page)) { unlock_page(page); continue; } ... where the WB_SYNC_NONE case will hit the "PageWriteback()" and just not do the writeback at all (but it also won't clear the dirty bit, so it's certainly not an *OBVIOUS* bug). We also have code like this ("pageout()"): if (clear_page_dirty_for_io(page)) { int res; struct writeback_control wbc = { .sync_mode = WB_SYNC_NONE, .. } ... res = mapping->a_ops->writepage(page, &wbc); and in this case, if the "WB_SYNC_NONE" means that the "writepage()" call won't do anything at all because of congestion, then that would be a _bad_ thing, and would certainly explain how something didn't get written out. But that particular path should only trigger for the "shrink_page_list()" case, and it's not the case I seem to be testing with my "low dirty_ratio" testing. Linus [-- Attachment #2: Type: TEXT/PLAIN, Size: 2872 bytes --] #include <sys/mman.h> #include <sys/fcntl.h> #include <unistd.h> #include <stdlib.h> #include <string.h> #include <stdio.h> #include <time.h> #define TARGETSIZE (22 << 20) #define CHUNKSIZE (1460) #define NRCHUNKS (TARGETSIZE / CHUNKSIZE) #define SIZE (NRCHUNKS * CHUNKSIZE) static void fillmem(void *start, int nr) { memset(start, nr, CHUNKSIZE); } #define page_offset(buf, off) (unsigned)((unsigned long)(buf)+(off)-(unsigned long)(mapping)) static int chunkorder[NRCHUNKS]; static char *mapping; static int order(int nr) { int i; if (nr < 0 || nr >= NRCHUNKS) return -1; for (i = 0; i < NRCHUNKS; i++) if (chunkorder[i] == nr) return i; return -2; } static void checkmem(void *buf, int nr) { unsigned int start = ~0u, end = 0; unsigned char c = nr, *p = buf, differs = 0; int i; for (i = 0; i < CHUNKSIZE; i++) { unsigned char got = *p++; if (got != c) { if (i < start) start = i; if (i > end) end = i; differs = got; } } if (start < end) { printf("Chunk %d corrupted (%u-%u) (%x-%x) \n", nr, start, end, page_offset(buf, start), page_offset(buf, end)); printf("Expected %u, got %u\n", c, differs); printf("Written as (%d)%d(%d)\n", order(nr-1), order(nr), order(nr+1)); } } static char *remap(int fd, char *mapping) { if (mapping) { munmap(mapping, SIZE); // fsync(fd); posix_fadvise(fd, 0, SIZE, POSIX_FADV_DONTNEED); } return mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); } int main(int argc, char **argv) { int fd, i; /* * Make some random ordering of writing the chunks to the * memory map.. * * Start with fully ordered.. */ for (i = 0; i < NRCHUNKS; i++) chunkorder[i] = i; /* ..and then mix it up randomly */ srandom(time(NULL)); for (i = 0; i < NRCHUNKS; i++) { int index = (unsigned int) random() % NRCHUNKS; int nr = chunkorder[index]; chunkorder[index] = chunkorder[i]; chunkorder[i] = nr; } fd = open("mapfile", O_RDWR | O_TRUNC | O_CREAT, 0666); if (fd < 0) return -1; if (ftruncate(fd, SIZE) < 0) return -1; mapping = remap(fd, NULL); if (-1 == (int)(long)mapping) return -1; for (i = 0; i < NRCHUNKS; i++) { int chunk = chunkorder[i]; printf("Writing chunk %d/%d (%d%%) \r", i, NRCHUNKS, 100*i/NRCHUNKS); fillmem(mapping + chunk * CHUNKSIZE, chunk); } printf("\n"); /* Unmap, drop, and remap.. */ mapping = remap(fd, mapping); /* .. and check */ for (i = 0; i < NRCHUNKS; i++) { int chunk = i; printf("Checking chunk %d/%d (%d%%) \r", i, NRCHUNKS, 100*i/NRCHUNKS); checkmem(mapping + chunk * CHUNKSIZE, chunk); } printf("\n"); /* Clean up for next time */ sleep(5); sync(); sleep(5); munmap(mapping, SIZE); close(fd); unlink("mapfile"); return 0; } ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one 2006-12-28 20:14 ` Linus Torvalds @ 2006-12-28 22:38 ` David Miller 2006-12-29 2:50 ` Segher Boessenkool 0 siblings, 1 reply; 311+ messages in thread From: David Miller @ 2006-12-28 22:38 UTC (permalink / raw) To: torvalds Cc: akpm, guichaz, ranma, gordonfarquharson, tbm, a.p.zijlstra, andrei.popa, hugh, nickpiggin, arjan, linux-kernel, kenneth.w.chen From: Linus Torvalds <torvalds@osdl.org> Date: Thu, 28 Dec 2006 12:14:31 -0800 (PST) > I get corruption - but the whole point is that it's very much pdflush that > should be writing these pages out. I think what might be happening is that pdflush writes them out fine, however we don't trap writes by the application _during_ that writeout. These corruptions look exactly as if: 1) pdflush begins writeback of page X 2) page goes to disk 3) application writes a chunk to the page 4) pdflush et al. think the page is clean, so it gets tossed, losing the writes done in #3 So there's a missing PTE change in there, so that we never get proper re-dirtying of the page if the application tries to write to the page during the writeback. It's something that will only occur with writeback and MAP_SHARED writable access to the file pages. That's why we never see this with normal filesystem writes, since those explicitly manage the page dirty state. I think the dirty balancing logic etc. isn't where the problems are, to me it's a PTE state update issue for sure. ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one 2006-12-28 22:38 ` David Miller @ 2006-12-29 2:50 ` Segher Boessenkool 2006-12-29 6:48 ` Linus Torvalds 0 siblings, 1 reply; 311+ messages in thread From: Segher Boessenkool @ 2006-12-29 2:50 UTC (permalink / raw) To: David Miller Cc: nickpiggin, kenneth.w.chen, guichaz, hugh, linux-kernel, ranma, torvalds, gordonfarquharson, akpm, a.p.zijlstra, tbm, arjan, andrei.popa > I think what might be happening is that pdflush writes them out fine, > however we don't trap writes by the application _during_ that writeout. Yeah. I believe that more exactly it happens if the very last write to the page causes a writeback (due to dirty balancing) while another writeback for the page is already happening. As usual in these cases, I have zero proof. > It's something that will only occur with writeback and MAP_SHARED > writable access to the file pages. It's the do_wp_page -> balance_dirty_pages -> generic_writepages path for sure. Maybe it's enough to change if (wbc->sync_mode != WB_SYNC_NONE) wait_on_page_writeback(page); if (PageWriteback(page) || !clear_page_dirty_for_io(page)) { unlock_page(page); continue; } to if (wbc->sync_mode != WB_SYNC_NONE) wait_on_page_writeback(page); if (PageWriteback(page)) { redirty_page_for_writepage(wbc, page); unlock_page(page); continue; } if (!clear_page_dirty_for_io(page)) { unlock_page(page); continue; } or something along those lines. Completely untested of course :-) Segher ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one 2006-12-29 2:50 ` Segher Boessenkool @ 2006-12-29 6:48 ` Linus Torvalds 2006-12-29 8:58 ` Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one) Linus Torvalds 2006-12-29 12:19 ` [patch] fix data corruption bug in __block_write_full_page() Ingo Molnar 0 siblings, 2 replies; 311+ messages in thread From: Linus Torvalds @ 2006-12-29 6:48 UTC (permalink / raw) To: Segher Boessenkool Cc: David Miller, nickpiggin, kenneth.w.chen, guichaz, hugh, linux-kernel, ranma, gordonfarquharson, akpm, a.p.zijlstra, tbm, arjan, andrei.popa On Fri, 29 Dec 2006, Segher Boessenkool wrote: > > > I think what might be happening is that pdflush writes them out fine, > > however we don't trap writes by the application _during_ that writeout. > > Yeah. I believe that more exactly it happens if the very last > write to the page causes a writeback (due to dirty balancing) > while another writeback for the page is already happening. > > As usual in these cases, I have zero proof. I actually have proof to the contrary, ie I have traces that say "the write was started" after the last write. And the VM layer in this area is actually fairly sane and civilized. It has a bit that says "writeback in progress", and if that bit is set, it simply _will_not_ start a new write. It even has various BUG_ON()'s to that effect. So everything I have ever seen says that the VM layer is actually doing everything right. > It's the do_wp_page -> balance_dirty_pages -> generic_writepages > path for sure. Maybe it's enough to change > > if (wbc->sync_mode != WB_SYNC_NONE) > wait_on_page_writeback(page); > > if (PageWriteback(page) || > !clear_page_dirty_for_io(page)) { > unlock_page(page); > continue; > } Notive how this one basically says: - if it's under writeback, don't even clear the page dirty flag. Your suggested change: > if (wbc->sync_mode != WB_SYNC_NONE) > wait_on_page_writeback(page); > > if (PageWriteback(page)) { > redirty_page_for_writepage(wbc, page); makes no sense, because we simply never _did_ the "clear_page_dirty()" if the thing was under writeback in the first place. That's how C conditionals work. So there's no reason to "redirty" it, because it wasn't cleaned in the first place. I've double- and triple-checked the dirty bits, including having traces that actually say that the IO was started (from a VM perspective) _after_ the last write was done. The IO just didn't hit the disk. I'm personally fairly convinced that it's not a VM issue, but a "IO issue". Either in a low-level filesystem or in some of the fs/buffer.c helper routines. But I'd love to be proven wrong. I do have a few interesting details from the trace I haven't really analyzed yet. Here's the trace for events on one of the pages that was corrupted. Note how the events are numbered (there were 171640 events total), so the thing you see is just a small set of events from the whole big trace, but it's the ones that talk about _that_ particular page. I've grouped them so hat "consecutive" events group together. That just means that no events on any other pages happened in between those events, and it is usually a sign that it's really one single call-chain that causes all the events. For example, for the first group of three events (44366-44368), it's the page fault that brings in the page, and since it's a write-fault, it will not only map the page, it will mark the page itself dirty and then also set the TAG_DIRTY on the mapping. So the "group" is just really a result of one single event happening, which causes several things to happen to that page. That's exactly what you'd expect. Anyway, here is the list of events that page went through: 44366 PG 00000f6d: mm/memory.c:2254 mapping at b789fc54 (write) 44367 PG 00000f6d: mm/page-writeback.c:817 setting dirty 44368 PG 00000f6d: fs/buffer.c:738 setting TAG_DIRTY 64231 PG 00000f6d: mm/page-writeback.c:872 clean_for_io 64232 PG 00000f6d: mm/rmap.c:451 cleaning PTE b789f000 64233 PG 00000f6d: mm/page-writeback.c:914 set writeback 64234 PG 00000f6d: mm/page-writeback.c:916 setting TAG_WRITEBACK 64235 PG 00000f6d: mm/page-writeback.c:922 clearing TAG_DIRTY 67570 PG 00000f6d: mm/page-writeback.c:891 end writeback 67571 PG 00000f6d: mm/page-writeback.c:893 clearing TAG_WRITEBACK 76705 PG 00000f6d: mm/page-writeback.c:817 setting dirty 76706 PG 00000f6d: fs/buffer.c:725 dirtied buffers 76707 PG 00000f6d: fs/buffer.c:738 setting TAG_DIRTY 105267 PG 00000f6d: mm/page-writeback.c:872 clean_for_io 105268 PG 00000f6d: mm/rmap.c:451 cleaning PTE b789f000 105269 PG 00000f6d: mm/page-writeback.c:914 set writeback 105270 PG 00000f6d: mm/page-writeback.c:916 setting TAG_WRITEBACK 105271 PG 00000f6d: mm/page-writeback.c:922 clearing TAG_DIRTY 105272 PG 00000f6d: mm/page-writeback.c:891 end writeback 105273 PG 00000f6d: mm/page-writeback.c:893 clearing TAG_WRITEBACK 128032 PG 00000f6d: mm/memory.c:670 unmapped at b789f000 132662 PG 00000f6d: mm/filemap.c:119 Removing page cache 148278 PG 00000f6d: mm/memory.c:2254 mapping at b789f000 (read) 166326 PG 00000f6d: mm/memory.c:670 unmapped at b789f000 171958 PG 00000f6d: mm/filemap.c:119 Removing page cache And notice that big grouping of seven events (105267-105273). The five first events really _do_ make sense together: it's our page cleaning that happens. But notice how the "end writeback" happens _immediately_. Here's another page cleaning event for the page that preceded that page, and did _not_ get corrupted: 105262 PG 00000f6c: mm/page-writeback.c:872 clean_for_io 105263 PG 00000f6c: mm/rmap.c:451 cleaning PTE b789e000 105264 PG 00000f6c: mm/page-writeback.c:914 set writeback 105265 PG 00000f6c: mm/page-writeback.c:916 setting TAG_WRITEBACK 105266 PG 00000f6c: mm/page-writeback.c:922 clearing TAG_DIRTY 108437 PG 00000f6c: mm/page-writeback.c:891 end writeback 108438 PG 00000f6c: mm/page-writeback.c:893 clearing TAG_WRITEBACK and this looks a lot more like what you'd expect: other thngs happened in between the "clear dirty, set writeback" stage and the "end writeback" stage. That's what you'd expect to see if there was actually overlapping IO and/or work. (And notice that that _was_ what you saw even for the corrupted page for the _first_ writeback: you saw the group-of-five that indicated a page cleaning event had started, and then a group-of-two to indicate that the writeback finished). So I find this kind of pattern really suspicious. We have a missing writeout, and my traces show (I didn't analyze this _particular_ one closely, but I did the previous trace for another page that I posted) that the writeback was actually started after the write that went missing was done. AND I have this trace that seems to show that the writeback basically completed immediately, with no other work in between. That to me says: "somebody didn't actually write out out". The VM layer asked the filesystem to do the write, but the filesystem just didn't do it. I personally think it's because some buffer-head BH_dirty bit got scrogged, but it could be some event that makes the filesystem simply not do the IO because it thinks the "disk queues are too full", so it just says "IO completed", without actually doing anything at all. Now, the fact that it apparently happens for all of ext2, ext3 and reiserfs (but NOT apparently with "data=writeback"), makes me suspect that there is some common interaction, and that it's somehow BH-related (they all share much of the buffer head infrastructure). So it doesn't look like it's just a bug in one random filesystem, I think it's a bug in some buffer-head infrastructure/helper function. So I don't think it's "core VM". I don't think it's the "page cache". I think we handle the dirty state correctly at that level. It looks more like "buffer cache" or "filesystem" to me by now. (Btw, don't get me wrong - the above sequence numbers are in no way *proof* of anything. You could get big groups for one page just because something ended up being synchronous. I'll add some timestamps to my traces to make it easier to see where there was real IO going on and where there wasn't). Linus ^ permalink raw reply [flat|nested] 311+ messages in thread
* Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one) 2006-12-29 6:48 ` Linus Torvalds @ 2006-12-29 8:58 ` Linus Torvalds 2006-12-29 10:48 ` Linus Torvalds 2006-12-29 15:27 ` Theodore Tso 2006-12-29 12:19 ` [patch] fix data corruption bug in __block_write_full_page() Ingo Molnar 1 sibling, 2 replies; 311+ messages in thread From: Linus Torvalds @ 2006-12-29 8:58 UTC (permalink / raw) To: Segher Boessenkool Cc: David Miller, nickpiggin, kenneth.w.chen, guichaz, hugh, linux-kernel, ranma, gordonfarquharson, akpm, a.p.zijlstra, tbm, arjan, andrei.popa On Thu, 28 Dec 2006, Linus Torvalds wrote: > > So everything I have ever seen says that the VM layer is actually doing > everything right. That was true, but at the same time, it's not. Let me explain. > That to me says: "somebody didn't actually write out out". The VM layer > asked the filesystem to do the write, but the filesystem just didn't do > it. I personally think it's because some buffer-head BH_dirty bit got > scrogged Ok, I have proof of this now. Here's a trace (with cycle counts), and with a new trace event added: this is for another corrupted page. I have: 49105 PG 000015d8 (14800): mm/page-writeback.c:872 clean_for_io 49106 PG 000015d8 (6900): mm/rmap.c:451 cleaning PTE b7fa6000 49107 PG 000015d8 (9900): mm/page-writeback.c:914 set writeback 49108 PG 000015d8 (6480): mm/page-writeback.c:916 setting TAG_WRITEBACK 49109 PG 000015d8 (7110): mm/page-writeback.c:922 clearing TAG_DIRTY 49110 PG 000015d8 (7190): fs/buffer.c:1713 no IO underway 49111 PG 000015d8 (6180): mm/page-writeback.c:891 end writeback 49112 PG 000015d8 (6460): mm/page-writeback.c:893 clearing TAG_WRITEBACK where that first column is the trace event number again, and the "PG 000015d8" is that corrupted page. The thing in the parenthesis is "CPU cycles since last event), and the important part to note is that this is indeed all one single thing with no actual IO anywhere (~7000 CPU cycles may sound like a lot, but (a) it's not that many cache misses and (b) a lot of it is the logging overhead - back-to-back log events will take about 3500 cycles) just because it does the actual ASCII printk() etc. Also, the new event is: fs/buffer.c:1713 no IO underway which is just the if (nr_underway == 0) case in fs/buffer.c And I now finally really believe that I fully understand the corruption, but I don't have a simple solution, much less a patch. What the problem basically boils down to is that "set_page_dirty()" is supposed to be a mark for dirtying THE WHOLE PAGE, but it really is not "the whole page when the 'set_page_dirty()' itself happens", but more of a "the next writepage() needs to write back the whole page" thing. And that's not what "__set_page_dirty_buffers()" really does. Because what "__set_page_dirty_buffers()" does is that AT THE TIME THE "set_page_dirty()" IS CALLED, it will mark all the buffers on that page as dirty. That may _sound_ like what we want, but it really isn't. Because by the time "writepage()" is actually called (which can be MUCH MUCH later), some internal filesystem activity may actually have cleaned one or more of those buffers in the meantime, and now we call "writepage()" (which really wants to write them _all_), and it will write only part of them, or none at all. So the VM thought that since it did a "writepage()", all the dirty state at that point got written back. But it didn't - the filesystem could have written back part or all of the page much earlier, and the writepage() actually does nothing at all. Both filesystem and VM actually _think_ they do the right thing, because they simply have totally different expectations. The filesystem thinks that it should care about dirty buffers (that got marked dirty _after_ they were dirtied), while the filesystem thinks that it cares about dirty _pages_ (that got dirted at any time _before_ "writepage()" was called). Neither is really "wrong", per se, it's just that the two parts have different expectations, and the _combination_ just doesn't work. "set_page_dirty()" at some point meant "the writes have been done", but these days it really means something else. Now, the reason there is no trivial patch is not that this is conceptually really hard to fix. I can see several different approaches to fixing it, but they all really boil down to two alternatives: (a) splitting the one "PG_dirty" bit up into two bits: the "PG_writescheduled" bit and the "PG_alldirty" bit. The "PG_write_scheduled" bit would be the bit that the filesystem would set when it has pending dirty data that it wrote itself (and that may not cover the whole page), and is the part of PG_dirty that sets the PAGECACHE_TAG_DIRTY. It's also what forces "writepage()" to be called. The "PG_alldirty" bit is just an additional "somebody else dirtied random parts of this page, and we don't know what" flag, which is set by "set_page_dirty()" in addition to doing the PG_write_scheduled stuff. We would test-and-clear it at "writepage()" time, and pass it in to "writepages()" to tell the writepage() function that it can't just write out its own small limited notion of what is dirty. (There are various variations on this whole theme: instead of having a flag to "writepage()", we could split the "whole page" case out as a separate callback or similar) (b) making sure that all "set_page_dirty()" calls are _after_ the page has been marked dirty (which in the case of memory mapped pages would mean that we would _not_ call it when we mark the page writable at all, we would call it when we _remove_ the dirty bit and mark it unwritable). That would have the nice fearture that it wouldn't require any FS-level changes, which would be a nice thing - it would basically make the VM dirty accounting work the way the FS layer now already expects it to. I think (b) is conceptually simpler, and I think I'll try it tomorrow after I've slept on it. The reason I mention (a) at all is that I like the conceptual notion of telling he filesystem ahead of time that "you're going to get a full dirty page", because what (b) will inevitably lead to is that the filesystem will maintain its own partial state, and then effectively just before it gets the writepage() notification, it will be told it was all pointless, because we just dirtied the whole thing. IOW, the advantage of (a) is also it's disadvantage: it tells the filesystem more. The disadvantage is that it would require VFS interface changes exactly to do that (ie the "mapping->set_page_dirty()" semantics would also be split up into two, and it would now be a "prepare to write the whole page during the next 'writepage()'" thing). So to recap: the VM layer really expected "writepage()" to act as if it wrote out the whole page. It doesn't. Not in the presense of the buffer layer and the filesystem having written out some buffers independently of the VM layer earlier. I think this also explains why "data=ordered" broke, and "data=writeback" didn't. When ext3 does "ordered" writebacks, it will do file data writebacks on its own, in _its_ order. In contrast, when it does "data=writeback", it will do the writebacks exactly as the VM presents them, and won't write any buffers on its own - which makes the bug go away, because now VM and FS end up agreeing about the semantics of "writepage()". Andrew, do you see anything wrong in my thinking? Peter - on a VM level, the fix would be: - remove the "set_page_dirty()" from the page fault path, and just set the PAGECACHE_TAG_DIRTY instead. - clear_page_dirty_for_io() would now need to check the mappings of the page even if it wasn't marked PG_dirty (or we'd have another page flag for the "page is dirty in page tables"), which is kind of a mixture of (a) and (b) cases above, except we don't expose it to the FS. - if it was dirty in the page tables, we do a "set_page_dirty()" after cleaning the page tables, and then the rest of "clear_page_dirty_for_io()" really boils down to a simple "TestAndClearDirty(page)" Hmm? I'd love it if somebody else wrote the patch and tested it, because I'm getting sick and tired of this bug ;) Linus ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one) 2006-12-29 8:58 ` Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one) Linus Torvalds @ 2006-12-29 10:48 ` Linus Torvalds 2006-12-29 11:16 ` Andrei Popa ` (5 more replies) 2006-12-29 15:27 ` Theodore Tso 1 sibling, 6 replies; 311+ messages in thread From: Linus Torvalds @ 2006-12-29 10:48 UTC (permalink / raw) To: Segher Boessenkool Cc: David Miller, nickpiggin, kenneth.w.chen, guichaz, hugh, Linux Kernel Mailing List, ranma, gordonfarquharson, Andrew Morton, a.p.zijlstra, tbm, arjan, andrei.popa On Fri, 29 Dec 2006, Linus Torvalds wrote: > > Hmm? I'd love it if somebody else wrote the patch and tested it, because > I'm getting sick and tired of this bug ;) Who the hell am I kidding? I haven't been able to sleep right for the last few days over this bug. It was really getting to me. And putting on the thinking cap, there's actually a fairly simple an nonintrusive patch. It still has a tiny tiny race (see the comment), but I bet nobody can really hit it in real life anyway, and I know several ways to fix it, so I'm not really _that_ worried about it. The patch is mostly a comment. The "real" meat of it is actually just a few lines. Can anybody get corruption with this thing applied? It goes on top of plain v2.6.20-rc2. Linus ---- diff --git a/mm/page-writeback.c b/mm/page-writeback.c index b3a198c..ec01da1 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -862,17 +862,46 @@ int clear_page_dirty_for_io(struct page *page) { struct address_space *mapping = page_mapping(page); - if (!mapping) - return TestClearPageDirty(page); - - if (TestClearPageDirty(page)) { - if (mapping_cap_account_dirty(mapping)) { - page_mkclean(page); + if (mapping && mapping_cap_account_dirty(mapping)) { + /* + * Yes, Virginia, this is indeed insane. + * + * We use this sequence to make sure that + * (a) we account for dirty stats properly + * (b) we tell the low-level filesystem to + * mark the whole page dirty if it was + * dirty in a pagetable. Only to then + * (c) clean the page again and return 1 to + * cause the writeback. + * + * This way we avoid all nasty races with the + * dirty bit in multiple places and clearing + * them concurrently from different threads. + * + * Note! Normally the "set_page_dirty(page)" + * has no effect on the actual dirty bit - since + * that will already usually be set. But we + * need the side effects, and it can help us + * avoid races. + * + * We basically use the page "master dirty bit" + * as a serialization point for all the different + * threds doing their things. + * + * FIXME! We still have a race here: if somebody + * adds the page back to the page tables in + * between the "page_mkclean()" and the "TestClearPageDirty()", + * we might have it mapped without the dirty bit set. + */ + if (page_mkclean(page)) + set_page_dirty(page); + if (TestClearPageDirty(page)) { dec_zone_page_state(page, NR_FILE_DIRTY); + return 1; } - return 1; + return 0; } - return 0; + return TestClearPageDirty(page); } EXPORT_SYMBOL(clear_page_dirty_for_io); ^ permalink raw reply related [flat|nested] 311+ messages in thread
* Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one) 2006-12-29 10:48 ` Linus Torvalds @ 2006-12-29 11:16 ` Andrei Popa 2006-12-29 12:09 ` Nick Piggin ` (4 subsequent siblings) 5 siblings, 0 replies; 311+ messages in thread From: Andrei Popa @ 2006-12-29 11:16 UTC (permalink / raw) To: Linus Torvalds Cc: Segher Boessenkool, David Miller, nickpiggin, kenneth.w.chen, guichaz, hugh, Linux Kernel Mailing List, ranma, gordonfarquharson, Andrew Morton, a.p.zijlstra, tbm, arjan On Fri, 2006-12-29 at 02:48 -0800, Linus Torvalds wrote: > > On Fri, 29 Dec 2006, Linus Torvalds wrote: > > > > Hmm? I'd love it if somebody else wrote the patch and tested it, because > > I'm getting sick and tired of this bug ;) > > Who the hell am I kidding? I haven't been able to sleep right for the last > few days over this bug. It was really getting to me. > > And putting on the thinking cap, there's actually a fairly simple an > nonintrusive patch. It still has a tiny tiny race (see the comment), but I > bet nobody can really hit it in real life anyway, and I know several ways > to fix it, so I'm not really _that_ worried about it. > > The patch is mostly a comment. The "real" meat of it is actually just a > few lines. > > Can anybody get corruption with this thing applied? It goes on top of > plain v2.6.20-rc2. Tested with rtorrent and there is no corruption. > > Linus > > ---- > diff --git a/mm/page-writeback.c b/mm/page-writeback.c > index b3a198c..ec01da1 100644 > --- a/mm/page-writeback.c > +++ b/mm/page-writeback.c > @@ -862,17 +862,46 @@ int clear_page_dirty_for_io(struct page *page) > { > struct address_space *mapping = page_mapping(page); > > - if (!mapping) > - return TestClearPageDirty(page); > - > - if (TestClearPageDirty(page)) { > - if (mapping_cap_account_dirty(mapping)) { > - page_mkclean(page); > + if (mapping && mapping_cap_account_dirty(mapping)) { > + /* > + * Yes, Virginia, this is indeed insane. > + * > + * We use this sequence to make sure that > + * (a) we account for dirty stats properly > + * (b) we tell the low-level filesystem to > + * mark the whole page dirty if it was > + * dirty in a pagetable. Only to then > + * (c) clean the page again and return 1 to > + * cause the writeback. > + * > + * This way we avoid all nasty races with the > + * dirty bit in multiple places and clearing > + * them concurrently from different threads. > + * > + * Note! Normally the "set_page_dirty(page)" > + * has no effect on the actual dirty bit - since > + * that will already usually be set. But we > + * need the side effects, and it can help us > + * avoid races. > + * > + * We basically use the page "master dirty bit" > + * as a serialization point for all the different > + * threds doing their things. > + * > + * FIXME! We still have a race here: if somebody > + * adds the page back to the page tables in > + * between the "page_mkclean()" and the "TestClearPageDirty()", > + * we might have it mapped without the dirty bit set. > + */ > + if (page_mkclean(page)) > + set_page_dirty(page); > + if (TestClearPageDirty(page)) { > dec_zone_page_state(page, NR_FILE_DIRTY); > + return 1; > } > - return 1; > + return 0; > } > - return 0; > + return TestClearPageDirty(page); > } > EXPORT_SYMBOL(clear_page_dirty_for_io); > ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one) 2006-12-29 10:48 ` Linus Torvalds 2006-12-29 11:16 ` Andrei Popa @ 2006-12-29 12:09 ` Nick Piggin 2006-12-29 17:25 ` Linus Torvalds 2006-12-29 12:31 ` Ingo Molnar ` (3 subsequent siblings) 5 siblings, 1 reply; 311+ messages in thread From: Nick Piggin @ 2006-12-29 12:09 UTC (permalink / raw) To: Linus Torvalds Cc: Segher Boessenkool, David Miller, kenneth.w.chen, guichaz, hugh, Linux Kernel Mailing List, ranma, gordonfarquharson, Andrew Morton, a.p.zijlstra, tbm, arjan, andrei.popa Hey nice work Linus! Linus Torvalds wrote: > > On Fri, 29 Dec 2006, Linus Torvalds wrote: > >>Hmm? I'd love it if somebody else wrote the patch and tested it, because >>I'm getting sick and tired of this bug ;) > > > Who the hell am I kidding? I haven't been able to sleep right for the last > few days over this bug. It was really getting to me. > > And putting on the thinking cap, there's actually a fairly simple an > nonintrusive patch. Yeah *this* makes more sense. And in retrospect it was simple, we can't just throw out pte dirtiness information if the page doesn't have all buffers dirtied. > It still has a tiny tiny race (see the comment), but I > bet nobody can really hit it in real life anyway, and I know several ways > to fix it, so I'm not really _that_ worried about it. Well the race isn't a data loss one, is it? Just a case where the pte may be dirty but the page dirty state not accounted for. Can we fix it by just putting the page_mkclean back inside the TestClearPageDirty check, and re-clearing PG_dirty after redoing the set_page_dirty? > > The patch is mostly a comment. The "real" meat of it is actually just a > few lines. > > Can anybody get corruption with this thing applied? It goes on top of > plain v2.6.20-rc2. > > Linus > > ---- > diff --git a/mm/page-writeback.c b/mm/page-writeback.c > index b3a198c..ec01da1 100644 > --- a/mm/page-writeback.c > +++ b/mm/page-writeback.c > @@ -862,17 +862,46 @@ int clear_page_dirty_for_io(struct page *page) > { > struct address_space *mapping = page_mapping(page); > > - if (!mapping) > - return TestClearPageDirty(page); > - > - if (TestClearPageDirty(page)) { > - if (mapping_cap_account_dirty(mapping)) { > - page_mkclean(page); > + if (mapping && mapping_cap_account_dirty(mapping)) { > + /* > + * Yes, Virginia, this is indeed insane. > + * > + * We use this sequence to make sure that > + * (a) we account for dirty stats properly > + * (b) we tell the low-level filesystem to > + * mark the whole page dirty if it was > + * dirty in a pagetable. Only to then > + * (c) clean the page again and return 1 to > + * cause the writeback. > + * > + * This way we avoid all nasty races with the > + * dirty bit in multiple places and clearing > + * them concurrently from different threads. > + * > + * Note! Normally the "set_page_dirty(page)" > + * has no effect on the actual dirty bit - since > + * that will already usually be set. But we > + * need the side effects, and it can help us > + * avoid races. > + * > + * We basically use the page "master dirty bit" > + * as a serialization point for all the different > + * threds doing their things. > + * > + * FIXME! We still have a race here: if somebody > + * adds the page back to the page tables in > + * between the "page_mkclean()" and the "TestClearPageDirty()", > + * we might have it mapped without the dirty bit set. > + */ > + if (page_mkclean(page)) > + set_page_dirty(page); > + if (TestClearPageDirty(page)) { > dec_zone_page_state(page, NR_FILE_DIRTY); > + return 1; > } > - return 1; > + return 0; > } > - return 0; > + return TestClearPageDirty(page); > } > EXPORT_SYMBOL(clear_page_dirty_for_io); > > -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one) 2006-12-29 12:09 ` Nick Piggin @ 2006-12-29 17:25 ` Linus Torvalds 0 siblings, 0 replies; 311+ messages in thread From: Linus Torvalds @ 2006-12-29 17:25 UTC (permalink / raw) To: Nick Piggin Cc: Segher Boessenkool, David Miller, kenneth.w.chen, guichaz, hugh, Linux Kernel Mailing List, ranma, gordonfarquharson, Andrew Morton, a.p.zijlstra, tbm, arjan, andrei.popa On Fri, 29 Dec 2006, Nick Piggin wrote: > > > It still has a tiny tiny race (see the comment), but I bet nobody can really > > hit it in real life anyway, and I know several ways to fix it, so I'm not > > really _that_ worried about it. > > Well the race isn't a data loss one, is it? Just a case where the > pte may be dirty but the page dirty state not accounted for. Right. We should be picking it up eventually, since it's still in the page tables, but if we've lost sight of the page dirtyness we won't react correctly to msync() and/or fdatasync(). So we don't _lose_ the data, we just might not write it out in a timely manner if we ever hit the race. > Can we fix it by just putting the page_mkclean back inside the > TestClearPageDirty check, and re-clearing PG_dirty after redoing > the set_page_dirty? I considered it, but quite frankly, if we did it that way, I'd really like to just fix the whole insane "set_page_dirty()" instead. I think set_page_dirty() should be split up. One thing that confused me mentally was that almost all of the dirty handling was actualyl done only if PG_dirty wasn't already set, so the _bulk_ of set_page_dirty() really ends up being if (!TestSetPageDirty(page)) { .. we just marked the page dirty, it was clean before, so we need to add it to the queues etc .. } and that's the part that I (and probably others) always really thought about. But then we have the _one_ thing that runs outside of that "do only once per dirty bit" logic, and it's the buffer dirtying. If we had had two separate operations for this all: "set_dirty_every_time()" and the regular "set_dirty()", I don't think this would have been nearly as confusing. (And then the difference between "__set_page_dirty_nobuffers()" and "__set_page_dirty_buffers()" really boils down to one doing the "everytime" _and_ the "once per dirty" checks and the other one doing just the "once per dirty bit" act - and we could rename the damn things to something saner too). If we split it up that way, then the whole clear_page_dirty_for_io() logic would boil down to if (TestClearPageDirty(page)) { if (page_mkclean(page)) set_dirty_every_time(); return 1; } return 0; and we wouldn't even need to do any of the "clear dirty again" kind of idiocy, because the "set_dirty_every_time()" stuff is the one that doesn't even care about the state of the PG_dirty bit - it's done regardless, and doesn't really touch it. That's what I wanted to do, but with the current "set_page_dirty()" setup, I think my patch makes reasonable sense. Linus ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one) 2006-12-29 10:48 ` Linus Torvalds 2006-12-29 11:16 ` Andrei Popa 2006-12-29 12:09 ` Nick Piggin @ 2006-12-29 12:31 ` Ingo Molnar 2006-12-29 13:08 ` Martin Johansson ` (2 subsequent siblings) 5 siblings, 0 replies; 311+ messages in thread From: Ingo Molnar @ 2006-12-29 12:31 UTC (permalink / raw) To: Linus Torvalds Cc: Segher Boessenkool, David Miller, nickpiggin, kenneth.w.chen, guichaz, hugh, Linux Kernel Mailing List, ranma, gordonfarquharson, Andrew Morton, a.p.zijlstra, tbm, arjan, andrei.popa * Linus Torvalds <torvalds@osdl.org> wrote: > > Hmm? I'd love it if somebody else wrote the patch and tested it, > > because I'm getting sick and tired of this bug ;) > > Who the hell am I kidding? I haven't been able to sleep right for the > last few days over this bug. It was really getting to me. > > And putting on the thinking cap, there's actually a fairly simple an > nonintrusive patch. [...] ok, your patch seems to fix the testcase here too on -rc2-rt. [ Damn, i should have slept a bit more, that would have saved me a ~4 hour debug and tracing session today to analyze your testcase, just to find your patch and your explanation on lkml, right after i sent my analysis and workaround patch ;-) At least now we know it from two independent tracing results that the suspect code is the same. ] Ingo ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one) 2006-12-29 10:48 ` Linus Torvalds ` (2 preceding siblings ...) 2006-12-29 12:31 ` Ingo Molnar @ 2006-12-29 13:08 ` Martin Johansson 2006-12-29 14:08 ` Martin Michlmayr 2006-12-29 22:16 ` Andrew Morton 5 siblings, 0 replies; 311+ messages in thread From: Martin Johansson @ 2006-12-29 13:08 UTC (permalink / raw) To: Linus Torvalds Cc: Segher Boessenkool, David Miller, nickpiggin, kenneth.w.chen, guichaz, hugh, Linux Kernel Mailing List, ranma, gordonfarquharson, Andrew Morton, a.p.zijlstra, tbm, arjan, andrei.popa Linus Torvalds wrote: >[...] > The patch is mostly a comment. The "real" meat of it is actually just a > few lines. > > Can anybody get corruption with this thing applied? It goes on top of > plain v2.6.20-rc2. No corruption with the testcase here. Will check with rtorrent too later today but I suppose it will work just fine. Nice work! It has been interesting (and educating) to follow this bug-hunt :) /Martin ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one) 2006-12-29 10:48 ` Linus Torvalds ` (3 preceding siblings ...) 2006-12-29 13:08 ` Martin Johansson @ 2006-12-29 14:08 ` Martin Michlmayr 2006-12-29 15:17 ` Stephen Clark 2006-12-29 22:16 ` Andrew Morton 5 siblings, 1 reply; 311+ messages in thread From: Martin Michlmayr @ 2006-12-29 14:08 UTC (permalink / raw) To: Linus Torvalds Cc: Segher Boessenkool, David Miller, nickpiggin, kenneth.w.chen, guichaz, hugh, Linux Kernel Mailing List, ranma, gordonfarquharson, Andrew Morton, a.p.zijlstra, arjan, andrei.popa * Linus Torvalds <torvalds@osdl.org> [2006-12-29 02:48]: > Can anybody get corruption with this thing applied? It goes on top > of plain v2.6.20-rc2. It works for me now, both your testcase as well as an installation of Debian on this ARM device. I manually applied the patch to 2.6.19. Thanks. -- Martin Michlmayr http://www.cyrius.com/ ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one) 2006-12-29 14:08 ` Martin Michlmayr @ 2006-12-29 15:17 ` Stephen Clark 2006-12-29 15:54 ` Martin Michlmayr 0 siblings, 1 reply; 311+ messages in thread From: Stephen Clark @ 2006-12-29 15:17 UTC (permalink / raw) To: Martin Michlmayr Cc: Linus Torvalds, Segher Boessenkool, David Miller, nickpiggin, kenneth.w.chen, guichaz, hugh, Linux Kernel Mailing List, ranma, gordonfarquharson, Andrew Morton, a.p.zijlstra, arjan, andrei.popa Martin Michlmayr wrote: >* Linus Torvalds <torvalds@osdl.org> [2006-12-29 02:48]: > > >>Can anybody get corruption with this thing applied? It goes on top >>of plain v2.6.20-rc2. >> >> > >It works for me now, both your testcase as well as an installation of >Debian on this ARM device. I manually applied the patch to 2.6.19. > >Thanks. > > Hi Martin, Can you post a diff against 2.6.19? Thanks, Steve -- "They that give up essential liberty to obtain temporary safety, deserve neither liberty nor safety." (Ben Franklin) "The course of history shows that as a government grows, liberty decreases." (Thomas Jefferson) ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one) 2006-12-29 15:17 ` Stephen Clark @ 2006-12-29 15:54 ` Martin Michlmayr 0 siblings, 0 replies; 311+ messages in thread From: Martin Michlmayr @ 2006-12-29 15:54 UTC (permalink / raw) To: Stephen Clark Cc: Linus Torvalds, Segher Boessenkool, David Miller, nickpiggin, kenneth.w.chen, guichaz, hugh, Linux Kernel Mailing List, ranma, gordonfarquharson, Andrew Morton, a.p.zijlstra, arjan, andrei.popa * Stephen Clark <Stephen.Clark@seclark.us> [2006-12-29 10:17]: > >It works for me now, both your testcase as well as an installation of > >Debian on this ARM device. I manually applied the patch to 2.6.19. > > Can you post a diff against 2.6.19? --- a/mm/page-writeback.c 2006-11-29 21:57:37.000000000 +0000 +++ b/mm/page-writeback.c 2006-12-29 11:02:55.555147896 +0000 @@ -893,16 +893,45 @@ { struct address_space *mapping = page_mapping(page); - if (mapping) { + if (mapping && mapping_cap_account_dirty(mapping)) { + /* + * Yes, Virginia, this is indeed insane. + * + * We use this sequence to make sure that + * (a) we account for dirty stats properly + * (b) we tell the low-level filesystem to + * mark the whole page dirty if it was + * dirty in a pagetable. Only to then + * (c) clean the page again and return 1 to + * cause the writeback. + * + * This way we avoid all nasty races with the + * dirty bit in multiple places and clearing + * them concurrently from different threads. + * + * Note! Normally the "set_page_dirty(page)" + * has no effect on the actual dirty bit - since + * that will already usually be set. But we + * need the side effects, and it can help us + * avoid races. + * + * We basically use the page "master dirty bit" + * as a serialization point for all the different + * threds doing their things. + * + * FIXME! We still have a race here: if somebody + * adds the page back to the page tables in + * between the "page_mkclean()" and the "TestClearPageDirty()", + * we might have it mapped without the dirty bit set. + */ + if (page_mkclean(page)) + set_page_dirty(page); if (TestClearPageDirty(page)) { - if (mapping_cap_account_dirty(mapping)) { - page_mkclean(page); - dec_zone_page_state(page, NR_FILE_DIRTY); - } + dec_zone_page_state(page, NR_FILE_DIRTY); return 1; } return 0; - } + } return TestClearPageDirty(page); } EXPORT_SYMBOL(clear_page_dirty_for_io); -- Martin Michlmayr http://www.cyrius.com/ ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one) 2006-12-29 10:48 ` Linus Torvalds ` (4 preceding siblings ...) 2006-12-29 14:08 ` Martin Michlmayr @ 2006-12-29 22:16 ` Andrew Morton 2006-12-29 22:24 ` Andrew Morton 2006-12-29 22:42 ` Linus Torvalds 5 siblings, 2 replies; 311+ messages in thread From: Andrew Morton @ 2006-12-29 22:16 UTC (permalink / raw) To: Linus Torvalds Cc: Segher Boessenkool, David Miller, nickpiggin, kenneth.w.chen, guichaz, hugh, Linux Kernel Mailing List, ranma, gordonfarquharson, a.p.zijlstra, tbm, arjan, andrei.popa On Fri, 29 Dec 2006 02:48:35 -0800 (PST) Linus Torvalds <torvalds@osdl.org> wrote: > + if (mapping && mapping_cap_account_dirty(mapping)) { > + /* > + * Yes, Virginia, this is indeed insane. > + * > + * We use this sequence to make sure that > + * (a) we account for dirty stats properly > + * (b) we tell the low-level filesystem to > + * mark the whole page dirty if it was > + * dirty in a pagetable. Only to then > + * (c) clean the page again and return 1 to > + * cause the writeback. > + * > + * This way we avoid all nasty races with the > + * dirty bit in multiple places and clearing > + * them concurrently from different threads. > + * > + * Note! Normally the "set_page_dirty(page)" > + * has no effect on the actual dirty bit - since > + * that will already usually be set. But we > + * need the side effects, and it can help us > + * avoid races. > + * > + * We basically use the page "master dirty bit" > + * as a serialization point for all the different > + * threds doing their things. > + * > + * FIXME! We still have a race here: if somebody > + * adds the page back to the page tables in > + * between the "page_mkclean()" and the "TestClearPageDirty()", > + * we might have it mapped without the dirty bit set. > + */ > + if (page_mkclean(page)) > + set_page_dirty(page); > + if (TestClearPageDirty(page)) { > dec_zone_page_state(page, NR_FILE_DIRTY); > + return 1; > } - Presumably reiser3's ordered-data mode has the same problem. And ext4, of course. Dunno about other filesytems. - The above change means that we do extra writeout. If a page is dirtied once, kjournald will write it and then pdflush will come along and needlessly write it again. But otoh, if a mapping is being repeatedly dirtied, kjournald will write the page once per 30 seconds (dirty_expire_centisecs) and pdflush will write the page once per 30 seconds as well. But we _should_ be writing it once per five seconds (kjournald commit interval). So we're still ahead ;) - Poor old IO accounting broke again. - People were saying that ext2 and ext3,data=writeback were also showing corruption. What's up with that? - For a long time I've wanted to nuke the current ext3/jbd ordered-data implementation altogether, and just make kjournald call into the standard writeback code to do a standard suberblock->inodes->pages walk. I think it'd be fairly straightforward to do. We'd need to teach the writeback code to be able to skip dirty pages which don't have a disk mapping, so that kjournald doesn't end up waiting for kjournald to free up journal space.. Would need to avoid possible deadlocks where someone calls ext3_force_commit() or otherwise does a synchronous commit while holding VFS locks. reiser3 and ext4 could be converted too. Not a short-term project, but this would avoid the problem. - It's pretty obnoxious that the VM now sets a clean page "dirty" and then proceeds to modify its contents. It would be nice to stop doing that. We could stop marking the page dirty in do_wp_page() and create a new VM counter "NR_PTE_DIRTY", which means "number of mapping_cap_account_dirty() pages which have a dirty pte pointing at them". Or, perhaps "number of dirty ptes which point at mapping_cap_account_dirty() pages". Which can be larger, but the writeout code will probably cope. Then we take NR_PTE_DIRTY into account in the dirty-page balancing act. So - do_wp_page() will still run balance_dirty_pages() - but it would no longer run set_page_dirty(). - But it needs to run mark_inode_dirty() so the fs-writeback code notices the file. - And mapping_tagged(mapping, PAGECACHE_TAG_DIRTY) becomes insufficient. The tricky part here is "how do we do the writeback"? The pte-dirty,!PageDirty pages aren't tagged as dirty in the radix-tree and writeback needs to find them so that it can effectively do an msync() on them. Walking all the mm's and vma's would be insane. Visiting all the pages in the file would also probably be insane. Perhaps this can be solved by adding a new radix-tree tag which means "this page might have dirty ptes pointing at it". For each file writeback would do a radix-tree walk of these pages, cleaning-and-write-protecting ptes, marking the corresponding pages dirty and clearing their PAGECACHE_TAG_PTE_DIRTY tags. Then we can fix the mapping_tagged(mapping, PAGECACHE_TAG_DIRTY) problem by doing mapping_tagged(mapping, PAGECACHE_TAG_DIRTY) || mapping_tagged(mapping, PAGECACHE_TAG_PTE_DIRTY) or, better, mapping_tagged(mapping, (1<<PAGECACHE_TAG_DIRTY)|(1<<PAGECACHE_TAG_PTE_DIRTY)) perhaps. The msync() code would need to be taught to call the PAGECACHE_TAG_PTE_DIRTY walker for the appropriate page range. This is also not a quick-fix. ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one) 2006-12-29 22:16 ` Andrew Morton @ 2006-12-29 22:24 ` Andrew Morton 2006-12-29 22:42 ` Linus Torvalds 1 sibling, 0 replies; 311+ messages in thread From: Andrew Morton @ 2006-12-29 22:24 UTC (permalink / raw) To: Linus Torvalds, Segher Boessenkool, David Miller, nickpiggin, kenneth.w.chen, guichaz, hugh, Linux Kernel Mailing List, ranma, gordonfarquharson, a.p.zijlstra, tbm, arjan, andrei.popa On Fri, 29 Dec 2006 14:16:32 -0800 Andrew Morton <akpm@osdl.org> wrote: > - Poor old IO accounting broke again. No it didn't - we're relying upon the behaviour of __set_page_dirty_buffers() against an already-dirty page. ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one) 2006-12-29 22:16 ` Andrew Morton 2006-12-29 22:24 ` Andrew Morton @ 2006-12-29 22:42 ` Linus Torvalds 2006-12-29 23:32 ` Theodore Tso 2006-12-29 23:51 ` Andrew Morton 1 sibling, 2 replies; 311+ messages in thread From: Linus Torvalds @ 2006-12-29 22:42 UTC (permalink / raw) To: Andrew Morton Cc: Segher Boessenkool, David Miller, nickpiggin, kenneth.w.chen, guichaz, hugh, Linux Kernel Mailing List, ranma, gordonfarquharson, a.p.zijlstra, tbm, arjan, andrei.popa On Fri, 29 Dec 2006, Andrew Morton wrote: > > - The above change means that we do extra writeout. If a page is dirtied > once, kjournald will write it and then pdflush will come along and > needlessly write it again. There's zero extra writeout for any flushing that flushes BY PAGES. Only broken flushers that flush by buffer heads (which really really really shouldn't be done any more: welcome to the 21st century) will cause extra writeouts. And those extra writeouts are obviously required for all the dirty state to actually hit the disk - which is the point of the patch. So they're not "extra" - they are "required for correct working". But I can't stress the fact enough that people SHOULD NOT do writeback by buffer heads. The buffer head has been purely an "IO entity" for the last several years now, and it's not a cache entity. Anybody who does writeback by buffer heads is basically bypassing the real cache (the page cache), and that's why all the problems happen. I think ext3 is terminally crap by now. It still uses buffer heads in places where it really really shouldn't, and as a result, things like directory accesses are simply slower than they should be. Sadly, I don't think ext4 is going to fix any of this, either. It's all just too inherently wrongly designed around the buffer head (which was correct in 1995, but hasn't been correct for a long time in the kernel any more). > - Poor old IO accounting broke again. No. That's why I used "set_page_dirty()" and did it that strange ugly way ("set page dirty, even though it's already dirty, and even though the very next thing we will do is TestClearPageDirty???"). That code looks strange as a result, which is why it now has more comments on it than actual code ;) > - People were saying that ext2 and ext3,data=writeback were also showing > corruption. What's up with that? I thought the "ext3,data=writeback" case was reported to be fine by several people? I'm not sure about ext2. I didn't look at what it did based on buffer heads. I would have expected it to be ok. That said, at least one report was later shown to be bogus (errors due to out of disk, not due to actual errors ;). > - For a long time I've wanted to nuke the current ext3/jbd ordered-data > implementation altogether, and just make kjournald call into the > standard writeback code to do a standard suberblock->inodes->pages walk. I really would like to see less of the buffer-head-based stuff, and yes, more of the normal inode page walking. I don't think you can "order" accesses within a page anyway, exactly because of memory mapping issues, so any page ordering is not about buffer heads on the page itself, it should be purely about metadata. > - It's pretty obnoxious that the VM now sets a clean page "dirty" and > then proceeds to modify its contents. It would be nice to stop doing > that. No. I think this really the fundamental confusion people had. People thought that setting the page dirty meant that it was no longer being modified. It hasn't meant that in a LONG time - ever since the whole DIRTY_TAG thing, the most important part of the PG_dirty thing has really been that it's now efficiently findable by the writeout logic. And that is very much what the whole page accounting _depends_ on. When we mmap a page, we need to mark it "findable" as dirty _before_ people actually start writing to it, because it's too late afterwards. > We could stop marking the page dirty in do_wp_page() and create a new > VM counter "NR_PTE_DIRTY", which means > > "number of mapping_cap_account_dirty() pages which have a dirty pte > pointing at them". Well, then you need to change what PAGE_MAPPING_TAG_DIRTY means too. That's very fundamental. That DIRTY _tag_ is now even more important than the PG_dirty bit itself, since that's what we actually use to _access_ those things. Linus ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one) 2006-12-29 22:42 ` Linus Torvalds @ 2006-12-29 23:32 ` Theodore Tso 2006-12-29 23:59 ` Linus Torvalds 2006-12-30 0:05 ` Andrew Morton 2006-12-29 23:51 ` Andrew Morton 1 sibling, 2 replies; 311+ messages in thread From: Theodore Tso @ 2006-12-29 23:32 UTC (permalink / raw) To: Linus Torvalds Cc: Andrew Morton, Segher Boessenkool, David Miller, nickpiggin, kenneth.w.chen, guichaz, hugh, Linux Kernel Mailing List, ranma, gordonfarquharson, a.p.zijlstra, tbm, arjan, andrei.popa, linux-ext4 On Fri, Dec 29, 2006 at 02:42:51PM -0800, Linus Torvalds wrote: > I think ext3 is terminally crap by now. It still uses buffer heads in > places where it really really shouldn't, and as a result, things like > directory accesses are simply slower than they should be. Sadly, I don't > think ext4 is going to fix any of this, either. Not just ext3; ocfs2 is using the jbd layer as well. I think we're going to have to put this (a rework of jbd2 to use the page cache) on the ext4 todo list, and work with the ocfs2 folks to try to come up with something that suits their needs as well. Fortunately we have this filesystem/storage summit thing coming up in the next few months, and we can try to get some discussion going on the linux-ext4 mailing list in the meantime. Unfortunately, I don't think this is going to be trivial. If we do get this fixed for ext4, one interesting question is whether people would accept a patch to backport the fixes to ext3, given the the grief this is causing the page I/O and VM routines. OTOH, reiser3 probably has the same problems, and I suspect the changes to ext3 to cause it to avoid buffer heads, especially in order to support for filesystem blocksizes < pagesize, are going to be sufficiently risky in terms of introducing regressions to ext3 that they would probably be rejected on those grounds. So unfortunately, we probably are going to have to support flushes via buffer heads for the foreseeable future. - Ted ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one) 2006-12-29 23:32 ` Theodore Tso @ 2006-12-29 23:59 ` Linus Torvalds 2006-12-30 0:05 ` Andrew Morton 1 sibling, 0 replies; 311+ messages in thread From: Linus Torvalds @ 2006-12-29 23:59 UTC (permalink / raw) To: Theodore Tso Cc: Andrew Morton, Segher Boessenkool, David Miller, nickpiggin, kenneth.w.chen, guichaz, hugh, Linux Kernel Mailing List, ranma, gordonfarquharson, a.p.zijlstra, tbm, arjan, andrei.popa, linux-ext4 On Fri, 29 Dec 2006, Theodore Tso wrote: > > If we do get this fixed for ext4, one interesting question is whether > people would accept a patch to backport the fixes to ext3, given the > the grief this is causing the page I/O and VM routines. I don't think backporting is the smartest option (unless it's done _way_ later), but the real problem with it isn't actually the VM behaviour, but simply the fact that cached performance absoluely _sucks_ with the buffer cache. With the physically indexed buffer cache thing, you end up always having to do these complicated translations into block numbers for every single access, and at some point when I benchmarked it, it was a huge overhead for doing simple things like readdir. It's also a major pain for read-ahead, exactly partly due to the high cost of translation - because you can't cheaply check whether the next block is there, the cost of even asking the question "should I try to read ahead?" is much much higher. As a result, read-ahead is seriously limited, because it's so expensive for the cached case (which is still hopefully the _common_ case). So because read-ahead is limited, the non-cached case then _really_ sucks. It was somewhat fixed in a really god-awful fashion by having ext3_readdir() actually do _readahead_ though the page cache, even though it does everything else through the buffer cache. And that just happens to work because we hopefully have physically contiguous blocks, but when that isn't true, the readahead doesn't do squat. It's really quite fundamentally broken. But none of that causes any problems for the VM, since directories cannot be mmap'ed anyway. But it's really pitiful, and it really doesn't work very well. Of course, other filesystems _also_ suck at this, and other operating systems haev even MORE problems, so people don't always seem to realize how horribly horribly broken this all is. I really wish somebody would write a filesystem that did large cold-cache directories well. Open some horrible file manager on /usr/bin with cold caches, and weep. The biggest problem is the inode indirection, but at some point when I looked at why it sucked, it was doing basically synchronous single-buffer reads on the directory too, because readahead didn't work properly. I was hoping that something like SpadFS would actually take off, because it seemed to do a lot of good design choices (having inodes in-line in the directory for when there are no hardlinks is probably a requirement for a good filesystem these days. The separate inode table had its uses, but indirection in a filesystem really does suck, and stat information is too important to be indirect unless it absolutely has to). But I suspect it needs more than somebody who just wants to get his thesis written ;) Linus ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one) 2006-12-29 23:32 ` Theodore Tso 2006-12-29 23:59 ` Linus Torvalds @ 2006-12-30 0:05 ` Andrew Morton 2006-12-30 0:50 ` Linus Torvalds 1 sibling, 1 reply; 311+ messages in thread From: Andrew Morton @ 2006-12-30 0:05 UTC (permalink / raw) To: Theodore Tso Cc: Linus Torvalds, Segher Boessenkool, David Miller, nickpiggin, kenneth.w.chen, guichaz, hugh, Linux Kernel Mailing List, ranma, gordonfarquharson, a.p.zijlstra, tbm, arjan, andrei.popa, linux-ext4 On Fri, 29 Dec 2006 18:32:07 -0500 Theodore Tso <tytso@mit.edu> wrote: > On Fri, Dec 29, 2006 at 02:42:51PM -0800, Linus Torvalds wrote: > > I think ext3 is terminally crap by now. It still uses buffer heads in > > places where it really really shouldn't, and as a result, things like > > directory accesses are simply slower than they should be. Sadly, I don't > > think ext4 is going to fix any of this, either. > > Not just ext3; ocfs2 is using the jbd layer as well. I think we're > going to have to put this (a rework of jbd2 to use the page cache) on > the ext4 todo list, and work with the ocfs2 folks to try to come up > with something that suits their needs as well. Fortunately we have > this filesystem/storage summit thing coming up in the next few months, > and we can try to get some discussion going on the linux-ext4 mailing > list in the meantime. Unfortunately, I don't think this is going to > be trivial. I suspect it would be insane to move any part of JBD (apart from the ordered-data flush) to use pagecache. The whole thing is fundamentally block-based. But only for metadata - there's no strong reason why ext3/4 needs to manipulate file data via buffer_heads if data=journal and chattr +j aren't in use. We could possibly move ext3/4 directories out of the blockdev pagecache and into per-directory pagecache, but that wouldn't change anything - the journalling would still be block-based. Adam Richter spent considerable time a few years ago trying to make the mpage code go direct-to-BIO in all cases and we eventually gave up. The conceptual layering of page<->blocks<->bio is pretty clean, and it is hard and ugly to fully optimise away the "block" bit in the middle. buffer_heads become more important with large PAGE_CACHE_SIZE. I'd expect nobh mode to be quite inefficient with some workloads on 64k pages. We need that representation of the state (and location) of the block-sized hunks which make up the page. > If we do get this fixed for ext4, one interesting question is whether > people would accept a patch to backport the fixes to ext3, given the > the grief this is causing the page I/O and VM routines. OTOH, reiser3 > probably has the same problems, and I suspect the changes to ext3 to > cause it to avoid buffer heads, especially in order to support for > filesystem blocksizes < pagesize, are going to be sufficiently risky > in terms of introducing regressions to ext3 that they would probably > be rejected on those grounds. So unfortunately, we probably are going > to have to support flushes via buffer heads for the foreseeable > future. We'll see. ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one) 2006-12-30 0:05 ` Andrew Morton @ 2006-12-30 0:50 ` Linus Torvalds 0 siblings, 0 replies; 311+ messages in thread From: Linus Torvalds @ 2006-12-30 0:50 UTC (permalink / raw) To: Andrew Morton Cc: Theodore Tso, Segher Boessenkool, David Miller, nickpiggin, kenneth.w.chen, guichaz, hugh, Linux Kernel Mailing List, ranma, gordonfarquharson, a.p.zijlstra, tbm, arjan, andrei.popa, linux-ext4 On Fri, 29 Dec 2006, Andrew Morton wrote: > > Adam Richter spent considerable time a few years ago trying to make the > mpage code go direct-to-BIO in all cases and we eventually gave up. The > conceptual layering of page<->blocks<->bio is pretty clean, and it is hard > and ugly to fully optimise away the "block" bit in the middle. Using the buffer cache as a translation layer to the physical address is fine. That's what _any_ block device will do. I'm not at all sayign that "buffer heads must go away". They work fine. What I'm saying is that - if you index by buffer heads, you're screwed. - if you do IO by starting at buffer heads, you're screwed. Both indexing and writeback decisions should be done at the page cache layer. Then, when you actually need to do IO, you look at the buffers. But you start from the "page". YOU SHOULD NEVER LOOK UP a buffer on its own merits, and YOU SHOULD NEVER DO IO on a buffer head on its own cognizance. So by all means keep the buffer heads as a way to keep the "virtual->physical" translation. It's what they were designed for. But they were _originally_ also designed for "lookup" and "driving the start of IO", and that is wrong, and has been wrong for a long time now, because - lookup based on physical address is fundamentally slow and inefficient. You have to look up the virtual->physical translation somewhere else, so it's by design an unnecessary indirection _and_ that "somewere else" is also by definition filesystem-specific, so you can't do any of these things at the VFS layer. Ergo: anything that needs to look up the physical address in order to find the buffer head is BROKEN in this day and age. We look up the _virtual_ page cache page, and then we can trivially find the buffer heads within that page thanks to page->buffers. Example: ext2 vs ext3 readdir. One of them sucks, the other doesn't. - starting IO based on the physical entity is insane. It's insane exactly _because_ the VM doesn't actually think in physical addresses, or in buffer-sized blocks. The VM only really knows about whole pages, and all the VM decisions fundamentally have to be page-based. We don't ever "free a buffer". We free a whole page, and as such, doing writeback based on buffers is pointless, because it doesn't actually say anything about the "page state" which is what the VM tracks. But neither of these means that "buffer_head" itself has to go away. They both really boil down to the same thing: you should never KEY things by the buffer head. All actions should be based on virtual indexes as far as at all humanly possible. Once you do lookup and locking and writeback _starting_ from the page, it's then easy to look up the actual buffer head within the page, and use that as a way to do the actual _IO_ on the physical address. So the buffer heads still exist in ext2, for example, but they don't drive the show quite as much. (They still do in some areas: the allocation bitmaps, the xattr code etc. But as long as none of those have big VM footprints, and as long as no _common_ operations really care deeply, and as long as those data structures never need to be touched by the VM or VFS layer, nobody will ever really care). The directory case comes up just because "readdir()" actually is very common, and sometimes very slow. And it can have a big VM working set footprint ("find"), so trying to be page-based actually really helps, because it all drives things like writeback on the _right_ issues, and we can do things like LRU's and writeback decisions on the level that really matters. I actually suspect that the inode tables could benefit from being in the page cache too (although I think that the inode buffer address is actually "physical", so there's no indirection for inode tables, which means that the virtual vs physical addressing doesn't matter). For directories, there definitely is a big cost to continually doing the virtual->physical translation all the time. Linus ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one) 2006-12-29 22:42 ` Linus Torvalds 2006-12-29 23:32 ` Theodore Tso @ 2006-12-29 23:51 ` Andrew Morton 2006-12-30 0:11 ` Linus Torvalds 1 sibling, 1 reply; 311+ messages in thread From: Andrew Morton @ 2006-12-29 23:51 UTC (permalink / raw) To: Linus Torvalds Cc: Segher Boessenkool, David Miller, nickpiggin, kenneth.w.chen, guichaz, hugh, Linux Kernel Mailing List, ranma, gordonfarquharson, a.p.zijlstra, tbm, arjan, andrei.popa On Fri, 29 Dec 2006 14:42:51 -0800 (PST) Linus Torvalds <torvalds@osdl.org> wrote: > > > On Fri, 29 Dec 2006, Andrew Morton wrote: > > > > - The above change means that we do extra writeout. If a page is dirtied > > once, kjournald will write it and then pdflush will come along and > > needlessly write it again. > > There's zero extra writeout for any flushing that flushes BY PAGES. > > Only broken flushers that flush by buffer heads (which really really > really shouldn't be done any more: welcome to the 21st century) will cause > extra writeouts. And those extra writeouts are obviously required for all > the dirty state to actually hit the disk - which is the point of the > patch. > > So they're not "extra" - they are "required for correct working". They're extra. As in "can be optimised away". > But I can't stress the fact enough that people SHOULD NOT do writeback by > buffer heads. The buffer head has been purely an "IO entity" for the last > several years now, and it's not a cache entity. The buffer_head is not an IO container. It is the kernel's core representation of a disk block. Usually (but not always) it is backed by some memory which is in pagecache. We can feed buffer_heads into IO containers via submit_bh(), but that's far from the only thing we use buffer_heads for. We should have done s/buffer_head/block/g years ago. JBD implements physical block-based journalling, so it is 100% appropriate that JBD deal with these disk blocks using their buffer_head representation. That being said, ordered-data mode isn't really part of the JBD journalling system at all (the data doesn't get journalled!) - ordered-mode is an add-on to the JBD journal to make the metadata which we're about to journal point at more-likely-to-be-correct data. JBD's ordered-mode writeback is just a sync and I see no conceptual problems with killing its old buffer_head based sync and moving it into the 21st century. > Anybody who does writeback > by buffer heads is basically bypassing the real cache (the page cache), > and that's why all the problems happen. > > I think ext3 is terminally crap by now. It still uses buffer heads in > places where it really really shouldn't, The ordered-data mode flush: sure. The rest of JBD's use of buffer_heads is quite appropriate. > and as a result, things like > directory accesses are simply slower than they should be. Sadly, I don't > think ext4 is going to fix any of this, either. I thought I fixed the performance problem? Somewhat nastily, but as ext3 directories are metadata it is appropriate that modifications to them be done in terms of buffer_heads (ie: blocks). > It's all just too inherently wrongly designed around the buffer head > (which was correct in 1995, but hasn't been correct for a long time in the > kernel any more). > > > - Poor old IO accounting broke again. > > No. That's why I used "set_page_dirty()" and did it that strange ugly way > ("set page dirty, even though it's already dirty, and even though the very > next thing we will do is TestClearPageDirty???"). nfs_set_page_dirty() and reiserfs_set_page_dirty() should now bail if PageDirty() to avoid needless work. > > - For a long time I've wanted to nuke the current ext3/jbd ordered-data > > implementation altogether, and just make kjournald call into the > > standard writeback code to do a standard suberblock->inodes->pages walk. > > I really would like to see less of the buffer-head-based stuff, and yes, > more of the normal inode page walking. I don't think you can "order" > accesses within a page anyway, exactly because of memory mapping issues, > so any page ordering is not about buffer heads on the page itself, it > should be purely about metadata. In this context ext3's "ordered" mode means "sync the file contents before journalling the metadata which points at it". > > - It's pretty obnoxious that the VM now sets a clean page "dirty" and > > then proceeds to modify its contents. It would be nice to stop doing > > that. > > No. I think this really the fundamental confusion people had. People > thought that setting the page dirty meant that it was no longer being > modified. No. Setting a page (or bh, or inode) dirty means "this is known to have been modified". ie: this cached entity is now out of sync with backing store. Ho hum. I don't care much, really. But then, I understand how all this stuff works. Try explaining to someone the relationship between pte-dirtiness, page-dirtiness, radix-tree-dirtiness and buffer_head-dirtiness. ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one) 2006-12-29 23:51 ` Andrew Morton @ 2006-12-30 0:11 ` Linus Torvalds 2006-12-30 0:33 ` Andrew Morton 0 siblings, 1 reply; 311+ messages in thread From: Linus Torvalds @ 2006-12-30 0:11 UTC (permalink / raw) To: Andrew Morton Cc: Segher Boessenkool, David Miller, nickpiggin, kenneth.w.chen, guichaz, hugh, Linux Kernel Mailing List, ranma, gordonfarquharson, a.p.zijlstra, tbm, arjan, andrei.popa On Fri, 29 Dec 2006, Andrew Morton wrote: > > They're extra. As in "can be optimised away". Sure. Don't use buffer heads. > The buffer_head is not an IO container. It is the kernel's core > representation of a disk block. Please come back from the 90's. The buffer heads are nothing but a mapping of where the hardware block is. If you use it for anything else, you're basically screwed. > JBD implements physical block-based journalling, so it is 100% appropriate > that JBD deal with these disk blocks using their buffer_head > representation. And as long as it does that, you just have to face the fact that it's going to perform like crap, including what you call "extra" writes, and what I call "deal with it". Btw, you can make pages be physically indexed too, but they obviously (a) won't be coherent with any virtual mapping laid on top of it (b) will be _physical_, so any readahead etc will be based on physical addresses too. > I thought I fixed the performance problem? No, you papered over it, for the reasonably common case where things were physically contiguous - exactly by using a physical page cache, so now it can do read-ahead based on that. Then, because the pages contain buffer heads, the directory accesses can look up buffers, and if it was all physically contiguous, it all works fine. But if you actually want virtualluy indexed caching (and all _users_ want it), it really doesn't work. > Somewhat nastily, but as ext3 directories are metadata it is appropriate > that modifications to them be done in terms of buffer_heads (ie: blocks). No. There is nothing "appropriate" about using buffer_heads for metadata. It's quite proper - and a hell of a lot more efficient - to use virtual page-caching for metadata too. Look at the ext2 readdir() implementation, and compare it to the crapola horror that is ext3. Guess what? ext2 uses virtually indexed metadata, and as a result it is both simpler, smaller and a LOT faster than ext3 in accessing that metadata. Face it, Andrew, you're wrong on this one. Really. Just take a look at ext2_readdir(). [ I'm not saying that ext2_readdir() is _beautiful_. If it had been written with the page cache in mind, it would probably have been done very differently. And it doesn't do any readahead, probably because nobody cared enough, but it should be trivial to add, and it would automatically "do the right thing" just because it's much easier at the page cache level. But I _am_ saying that compared to ext3, the ext2 readdir is a work of art. ] "metadata" has _zero_ to do with "physically indexed". There is no correlation what-so-ever. If you think there is a correlation, it's all in your mind. Linus ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one) 2006-12-30 0:11 ` Linus Torvalds @ 2006-12-30 0:33 ` Andrew Morton 2006-12-30 0:58 ` Linus Torvalds 0 siblings, 1 reply; 311+ messages in thread From: Andrew Morton @ 2006-12-30 0:33 UTC (permalink / raw) To: Linus Torvalds Cc: Segher Boessenkool, David Miller, nickpiggin, kenneth.w.chen, guichaz, hugh, Linux Kernel Mailing List, ranma, gordonfarquharson, a.p.zijlstra, tbm, arjan, andrei.popa On Fri, 29 Dec 2006 16:11:44 -0800 (PST) Linus Torvalds <torvalds@osdl.org> wrote: > > > > JBD implements physical block-based journalling, so it is 100% appropriate > > that JBD deal with these disk blocks using their buffer_head > > representation. > > And as long as it does that, you just have to face the fact that it's > going to perform like crap, including what you call "extra" writes, and > what I call "deal with it". It is quite tiresome to delete things which your interlocutor said and to then restate them as if it were some sort of relevation. > > Somewhat nastily, but as ext3 directories are metadata it is appropriate > > that modifications to them be done in terms of buffer_heads (ie: blocks). > > No. There is nothing "appropriate" about using buffer_heads for metadata. I said "modification". > [stuff about directory reads elided] ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one) 2006-12-30 0:33 ` Andrew Morton @ 2006-12-30 0:58 ` Linus Torvalds 2006-12-30 1:16 ` Andrew Morton 0 siblings, 1 reply; 311+ messages in thread From: Linus Torvalds @ 2006-12-30 0:58 UTC (permalink / raw) To: Andrew Morton Cc: Segher Boessenkool, David Miller, nickpiggin, kenneth.w.chen, guichaz, hugh, Linux Kernel Mailing List, ranma, gordonfarquharson, a.p.zijlstra, tbm, arjan, andrei.popa On Fri, 29 Dec 2006, Andrew Morton wrote: > > > > Somewhat nastily, but as ext3 directories are metadata it is appropriate > > > that modifications to them be done in terms of buffer_heads (ie: blocks). > > > > No. There is nothing "appropriate" about using buffer_heads for metadata. > > I said "modification". You said "metadata". Why do you think directories are any different from files? Yes, they are metadata. So what? What does that have to do with anything? They should still use virtual indexes, the way files do. That doesn't preclude them from using buffer-heads to mark their (partial-page) modifications and for keeping the virtual->physical translations cached. I mean, really. Look at ext2. It does exactly that. It keeps the directories in the page cache - virtually indexed. And it even modifies them there. Exactly the same way it modifies regular file data. It all works exactly the same way it works for regular files. It uses page->mapping->a_ops->prepare_write(NULL, page, from, to); ... do modification ... ext2_commit_chunk(page, from, to); exactly the way regular file data works. That's why I'm saying there is absolutely _zero_ thing about "metadata" here, or even about "modifications". It all works better in a virtual cache, because you get all the support that we give to page caches. So I really don't understand why you make excuses for ext3 and talk about "modifications" and "metadata". It was a fine design ten years ago. It's not really very good any longer. I suspect we're stuck with the design, but that doesn't make it any _better_. Linus ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one) 2006-12-30 0:58 ` Linus Torvalds @ 2006-12-30 1:16 ` Andrew Morton 0 siblings, 0 replies; 311+ messages in thread From: Andrew Morton @ 2006-12-30 1:16 UTC (permalink / raw) To: Linus Torvalds Cc: Segher Boessenkool, David Miller, nickpiggin, kenneth.w.chen, guichaz, hugh, Linux Kernel Mailing List, ranma, gordonfarquharson, a.p.zijlstra, tbm, arjan, andrei.popa On Fri, 29 Dec 2006 16:58:41 -0800 (PST) Linus Torvalds <torvalds@osdl.org> wrote: > > > On Fri, 29 Dec 2006, Andrew Morton wrote: > > > > > > Somewhat nastily, but as ext3 directories are metadata it is appropriate > > > > that modifications to them be done in terms of buffer_heads (ie: blocks). > > > > > > No. There is nothing "appropriate" about using buffer_heads for metadata. > > > > I said "modification". > > You said "metadata". > > Why do you think directories are any different from files? Yes, they are > metadata. So what? What does that have to do with anything? We journal the contents of directories. Fully. So we handle their dirty data at the block (ie: buffer_head) level. When someone tries to dirty part of a directory we need to cheat and not mark that part of the page as dirty and we need to then write the block to the journal and then mark the block as really dirty for checkpointing (but still attached to the journal) and all that goop. The regular page-based writeback doesn't apply until the block has been written to the journal. At that stage the block is considered dirty against its real position on disk. It will then be written back by pdflush via the blockdev inode -> blkdev_writepage(). Unless kjournald needs to do an early flush to reclaim the journal space, in which case kjournald will write the block itself. > > So I really don't understand why you make excuses for ext3 and talk about > "modifications" and "metadata". It was a fine design ten years ago. It's > not really very good any longer. > As I said in another apparently-neglected email: : We could possibly move ext3/4 directories out of the blockdev pagecache and : into per-directory pagecache, but that wouldn't change anything - the : journalling would still be block-based. We already have all the code in place to journal blocks which are cached in an address_space other than the blockdev inode's: ext3_journalled_aops. ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one) 2006-12-29 8:58 ` Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one) Linus Torvalds 2006-12-29 10:48 ` Linus Torvalds @ 2006-12-29 15:27 ` Theodore Tso 2006-12-29 17:51 ` Linus Torvalds 1 sibling, 1 reply; 311+ messages in thread From: Theodore Tso @ 2006-12-29 15:27 UTC (permalink / raw) To: Linus Torvalds Cc: Segher Boessenkool, David Miller, nickpiggin, kenneth.w.chen, guichaz, hugh, linux-kernel, ranma, gordonfarquharson, akpm, a.p.zijlstra, tbm, arjan, andrei.popa On Fri, Dec 29, 2006 at 12:58:12AM -0800, Linus Torvalds wrote: > Because what "__set_page_dirty_buffers()" does is that AT THE TIME THE > "set_page_dirty()" IS CALLED, it will mark all the buffers on that page as > dirty. That may _sound_ like what we want, but it really isn't. Because by > the time "writepage()" is actually called (which can be MUCH MUCH later), > some internal filesystem activity may actually have cleaned one or more of > those buffers in the meantime, and now we call "writepage()" (which really > wants to write them _all_), and it will write only part of them, or none > at all. I'm confused. Does this mean that if "fs blocksize"=="VM pagesize" this bug can't trigger? But I thought at least one of people reporting corruption was using a filesystem with a 4k block size on an i386? - Ted ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one) 2006-12-29 15:27 ` Theodore Tso @ 2006-12-29 17:51 ` Linus Torvalds 0 siblings, 0 replies; 311+ messages in thread From: Linus Torvalds @ 2006-12-29 17:51 UTC (permalink / raw) To: Theodore Tso Cc: Segher Boessenkool, David Miller, nickpiggin, kenneth.w.chen, guichaz, hugh, linux-kernel, ranma, gordonfarquharson, akpm, a.p.zijlstra, tbm, arjan, andrei.popa On Fri, 29 Dec 2006, Theodore Tso wrote: > > I'm confused. Does this mean that if "fs blocksize"=="VM pagesize" > this bug can't trigger? No. Even if there is just a single buffer-head, if the filesystem ever writes out that _single_ buffer-head out of turn (ie before the VM actually asks it to, with "->writepage()"), then the same issue will happen. In fact, a bigger fs blocksize will likely just make this easier to trigger (although I doubt it makes a big difference), since any out-of-order buffer flushback will happen for the whole page, rather than just a part of the page. So the "problem" really ends up being that the filesystem does flushing that the VM isn't aware of, so when the VM did "set_page_dirty()" at an earlier time, the VM _expected_ the "->writepages()" call that happened much later to write the whole page - but because the FS had flushed things behind it backs even _before_ the "->writepage" happens, by the time the VM actually asks for the page to be written out, the FS layer won't actually write it all out any more. Blocksize doesn't matter, the only thing that matters is whether something writes out data on a buffer-cache level, not on a "page cache" level. Ext3 apparently does this in "ordered" data more at least (and hey, I suspect that the code that tries to release buffer head data might try to do it on its own too). Linus ^ permalink raw reply [flat|nested] 311+ messages in thread
* [patch] fix data corruption bug in __block_write_full_page() 2006-12-29 6:48 ` Linus Torvalds 2006-12-29 8:58 ` Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one) Linus Torvalds @ 2006-12-29 12:19 ` Ingo Molnar 2007-01-02 11:20 ` Christoph Hellwig 1 sibling, 1 reply; 311+ messages in thread From: Ingo Molnar @ 2006-12-29 12:19 UTC (permalink / raw) To: Linus Torvalds Cc: Segher Boessenkool, David Miller, nickpiggin, kenneth.w.chen, guichaz, hugh, linux-kernel, ranma, gordonfarquharson, akpm, a.p.zijlstra, tbm, arjan, andrei.popa * Linus Torvalds <torvalds@osdl.org> wrote: > I do have a few interesting details from the trace I haven't really > analyzed yet. Here's the trace for events on one of the pages that was > corrupted. Note how the events are numbered (there were 171640 events > total), so the thing you see is just a small set of events from the > whole big trace, but it's the ones that talk about _that_ particular > page. i've extended the tracer in -rt to trace all relevant pagetable, pagecache, buffer-cache and IO events and coupled the tracer to your test.c code. The corruption happens here: test-2126 0.... 3756170us+: trace_page (cf20ebd8 b6a2c000 0) pdflush-2006 0.... 6432909us+: trace_page (cf20ebd8 b6a2c000 4200420) test-2126 0.... 8135596us+: trace_page (cf20ebd8 b6a2c000 4200420) test-2126 0D... 9012933us+: do_page_fault (8048900 4 b6a2c000) test-2126 0.... 9023278us+: trace_page (cf262f24 b6a2c000 0) test-2126 0.... 9023305us > sys_prctl (000000d8 b6a2c000 000000ac) address 0xb6a2c000 is the one that shows the corruption. Now, this address is mapped to page cf262f24 when the bug happened, but it had page 0xcf20ebd8 mapped to it 3 seconds ago, which has this history: test-2126 0.... 3756413us+: trace_page (cf20ebd8 0 0) test-2126 0.... 3756469us+: trace_page (cf20ebd8 0 0) test-2126 0.... 3757341us+: trace_page (cf20ebd8 10 0) IRQ-14-402 0.... 3759332us+: trace_page (cf20ebd8 ffffffff 0) IRQ-14-402 0.... 3759376us+: trace_page (cf20ebd8 ffffffff 0) test-2126 0.... 5104662us+: trace_page (cf20ebd8 b6a2c400 0) test-2126 0.... 5104687us+: trace_page (cf20ebd8 1 0) pdflush-2006 0.... 6432909us+: trace_page (cf20ebd8 b6a2c000 4200420) pdflush-2006 0.... 6432952us+: trace_page (cf20ebd8 ffffffff 4200420) pdflush-2006 0.... 6432986us+: trace_page (cf20ebd8 1 4200420) pdflush-2006 0.... 6433022us+: trace_page (cf20ebd8 4096 4200420) pdflush-2006 0.... 6433061us+: trace_page (cf20ebd8 0 4200420) pdflush-2006 0.... 6433112us+: trace_page (cf20ebd8 0 4200420) pdflush-2006 0.... 6433154us+: trace_page (cf20ebd8 0 4200420) pdflush-2006 0.... 6433303us+: trace_page (cf20ebd8 11 4200420) pdflush-2006 0.... 6433343us+: trace_page (cf20ebd8 13 4200420) pdflush-2006 0.... 6433382us+: trace_page (cf20ebd8 14 4200420) pdflush-2006 0.... 6433421us+: trace_page (cf20ebd8 15 4200420) pdflush-2006 0.... 6433460us+: trace_page (cf20ebd8 ffffffff 4200420) pdflush-2006 0.... 6433504us+: trace_page (cf20ebd8 ffffffff 4200420) test-2126 0.... 8135596us+: trace_page (cf20ebd8 b6a2c000 4200420) in particular timestamp 6433421us is interesting: pdflush-2006 0.... 6433504us+: trace_page (cf20ebd8 ffffffff 4200420) pdflush-2006 0.... 6433526us : trace_page()<-test_clear_page_writeback()<-end_page_writeback()<-__block_write_full_page() pdflush-2006 0.... 6433526us+: block_write_full_page()<-ext3_ordered_writepage()<-generic_writepages()<-(-1)() i.e. the page got its pending writeback cancelled in block_write_full_page(), without any IRQ#14 activity whatsoever! That looks quite suspect. It is this piece of code in __block_write_full_page(): /* * The page was marked dirty, but the buffers were * clean. Someone wrote them back by hand with * ll_rw_block/submit_bh. A rare case. */ .... if (uptodate) SetPageUptodate(page); end_page_writeback(page); A 'rare case' ... hm. So i tried a quick workaround below, just to keep us from marking the page clean, to see whether the corruption goes away - and i was unable to trigger the corruption after half an hour of testing, while before it triggered within 10 seconds! now this patch is only an ugly hack, but the bug definitely seems to be related to buffer management, as you suspected. Ingo --- fs/buffer.c | 1 + 1 file changed, 1 insertion(+) Index: linux/fs/buffer.c =================================================================== --- linux.orig/fs/buffer.c +++ linux/fs/buffer.c @@ -1702,6 +1702,7 @@ done: } while (bh != head); if (uptodate) SetPageUptodate(page); + set_page_dirty(page); end_page_writeback(page); /* * The page and buffer_heads can be released at any time from ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [patch] fix data corruption bug in __block_write_full_page() 2006-12-29 12:19 ` [patch] fix data corruption bug in __block_write_full_page() Ingo Molnar @ 2007-01-02 11:20 ` Christoph Hellwig 2007-01-02 12:06 ` Ingo Molnar 0 siblings, 1 reply; 311+ messages in thread From: Christoph Hellwig @ 2007-01-02 11:20 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Segher Boessenkool, David Miller, nickpiggin, kenneth.w.chen, guichaz, hugh, linux-kernel, ranma, gordonfarquharson, akpm, a.p.zijlstra, tbm, arjan, andrei.popa On Fri, Dec 29, 2006 at 01:19:46PM +0100, Ingo Molnar wrote: > i've extended the tracer in -rt to trace all relevant pagetable, > pagecache, buffer-cache and IO events and coupled the tracer to your > test.c code. The corruption happens here: > > test-2126 0.... 3756170us+: trace_page (cf20ebd8 b6a2c000 0) > pdflush-2006 0.... 6432909us+: trace_page (cf20ebd8 b6a2c000 4200420) > test-2126 0.... 8135596us+: trace_page (cf20ebd8 b6a2c000 4200420) > test-2126 0D... 9012933us+: do_page_fault (8048900 4 b6a2c000) > test-2126 0.... 9023278us+: trace_page (cf262f24 b6a2c000 0) > test-2126 0.... 9023305us > sys_prctl (000000d8 b6a2c000 000000ac) This tracer definitly looks interesting. Could you send a splitout patch with it to lkml for review? ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [patch] fix data corruption bug in __block_write_full_page() 2007-01-02 11:20 ` Christoph Hellwig @ 2007-01-02 12:06 ` Ingo Molnar 2007-01-02 12:16 ` Christoph Hellwig 0 siblings, 1 reply; 311+ messages in thread From: Ingo Molnar @ 2007-01-02 12:06 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-kernel, Linus Torvalds, Andrew Morton [Cc:-ed lkml] * Christoph Hellwig <hch@infradead.org> wrote: > On Fri, Dec 29, 2006 at 01:19:46PM +0100, Ingo Molnar wrote: > > i've extended the tracer in -rt to trace all relevant pagetable, > > pagecache, buffer-cache and IO events and coupled the tracer to your > > test.c code. The corruption happens here: > > > > test-2126 0.... 3756170us+: trace_page (cf20ebd8 b6a2c000 0) > > pdflush-2006 0.... 6432909us+: trace_page (cf20ebd8 b6a2c000 4200420) > > test-2126 0.... 8135596us+: trace_page (cf20ebd8 b6a2c000 4200420) > > test-2126 0D... 9012933us+: do_page_fault (8048900 4 b6a2c000) > > test-2126 0.... 9023278us+: trace_page (cf262f24 b6a2c000 0) > > test-2126 0.... 9023305us > sys_prctl (000000d8 b6a2c000 000000ac) > > This tracer definitly looks interesting. Could you send a splitout > patch with it to lkml for review? Find it below - it's ontop of the tracer included in 2.6.20-rc2-rt3. it's very ad-hoc, based on Linus' test utility. I can write such a tracer in 30 minutes so i usually throw them away. I literally wrote dozens of tracer variants for specific bugs in the past few years. Note: this particular one tracks page contents as well from kernel-space, that's how i was able to see where the corruption happened. That assumes that there's no highmem on the box. Also, the pte value tracking portion is only for i386 - etc. etc. Note: for the bug to be visible i didnt need the per-page tracking portion of the tracer - the key was to track page contents, and to track how virtual addresses map to physical pages, and how their IO happens. This patch is /not/ for merging: this patch too undescores my years long experience that static tracepoints included in the generic kernel are just pointless in general - i dont want to see such cruft in the kernel, and they amass with time. The union of all ad-hoc tracing hacks i had in the past would be thousands of static tracepoints - and that's just /me/. If we pick only a handful they wont help us find the most difficult bugs and they'll only create additional 'demand' for 'more' - leading to an endless fight. The best method i think is to use the source code itself (Linus used printks) - or if any infrastructure is to be used then ad-hoc "scriptlets" via SystemTap can find the really difficult bugs - and in the long run systemtap suits that purpose best. If systemtap were ubiquous we could have sent scriptlets to users who experienced the bugs, for them to install them dynamically. Systemtap makes it plain obvious that tracepoints are 1) detached from the source code and are 2) are temporary and ad-hoc in nature. It doesnt create undue pressure to include more and more static tracepoints. Ingo -----------> fs/buffer.c | 1 include/asm-i386/pgtable-2level.h | 4 + include/linux/mm_types.h | 22 +++++++++ kernel/sys.c | 15 ++++++ mm/Makefile | 2 mm/memory.c | 2 mm/page-writeback.c | 33 ++++++++++++-- mm/page_alloc.c | 3 + mm/page_trace.c | 84 ++++++++++++++++++++++++++++++++++++++ mm/rmap.c | 2 10 files changed, 159 insertions(+), 9 deletions(-) Index: linux/fs/buffer.c =================================================================== --- linux.orig/fs/buffer.c +++ linux/fs/buffer.c @@ -1590,6 +1590,7 @@ static int __block_write_full_page(struc int nr_underway = 0; BUG_ON(!PageLocked(page)); + trace_page(page, blocksize); last_block = (i_size_read(inode) - 1) >> inode->i_blkbits; Index: linux/include/asm-i386/pgtable-2level.h =================================================================== --- linux.orig/include/asm-i386/pgtable-2level.h +++ linux/include/asm-i386/pgtable-2level.h @@ -13,7 +13,9 @@ */ #ifndef CONFIG_PARAVIRT #define set_pte(pteptr, pteval) (*(pteptr) = pteval) -#define set_pte_at(mm,addr,ptep,pteval) set_pte(ptep,pteval) +struct mm_struct; +extern void trace_set_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pte_val); +#define set_pte_at(mm,addr,ptep,pteval) trace_set_pte_at(mm,addr,ptep,pteval) #define set_pmd(pmdptr, pmdval) (*(pmdptr) = (pmdval)) #endif Index: linux/include/linux/mm_types.h =================================================================== --- linux.orig/include/linux/mm_types.h +++ linux/include/linux/mm_types.h @@ -5,9 +5,29 @@ #include <linux/threads.h> #include <linux/list.h> #include <linux/spinlock.h> +#include <linux/stacktrace.h> struct address_space; +struct page; +struct seq_file; + +#define PAGE_TRACE_DEPTH 16 +#define PAGE_TRACE_NR 20 + +struct page_trace_entry { + unsigned long timestamp; + char comm[17]; + int pid; + int nr_entries; + unsigned long info; + unsigned long content; + unsigned long entries[PAGE_TRACE_DEPTH]; +}; + +extern void trace_page(struct page *page, unsigned long info); +extern void print_page_trace(struct seq_file *m, struct page *page); + /* * Each physical page in the system has a struct page associated with * it to keep track of whatever it is we are using the page for at the @@ -62,6 +82,8 @@ struct page { void *virtual; /* Kernel virtual address (NULL if not kmapped, ie. highmem) */ #endif /* WANT_PAGE_VIRTUAL */ + int trace_idx; + struct page_trace_entry trace[PAGE_TRACE_NR]; }; #endif /* _LINUX_MM_TYPES_H */ Index: linux/kernel/sys.c =================================================================== --- linux.orig/kernel/sys.c +++ linux/kernel/sys.c @@ -2067,6 +2067,21 @@ asmlinkage long sys_prctl(int option, un { long error; + if (option == 999) { + unsigned long addr = arg2; + struct vm_area_struct *vma = find_vma(current->mm, addr); + struct page *page = NULL; + + printk("page trace, got addr %08lx, vma %p\n", addr, vma); + if (vma) { + page = follow_page(vma, addr, FOLL_GET); + if (page) { + print_page_trace(NULL, page); + put_page(page); + } + } + return 0; + } #ifdef CONFIG_EVENT_TRACE if (option == PR_SET_TRACING) { if (arg2) Index: linux/mm/Makefile =================================================================== --- linux.orig/mm/Makefile +++ linux/mm/Makefile @@ -9,7 +9,7 @@ mmu-$(CONFIG_MMU) := fremap.o highmem.o obj-y := bootmem.o filemap.o mempool.o oom_kill.o fadvise.o \ page_alloc.o page-writeback.o pdflush.o \ - readahead.o swap.o truncate.o vmscan.o \ + readahead.o swap.o truncate.o vmscan.o page_trace.o \ prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \ $(mmu-y) Index: linux/mm/memory.c =================================================================== --- linux.orig/mm/memory.c +++ linux/mm/memory.c @@ -451,6 +451,8 @@ struct page *vm_normal_page(struct vm_ar * The PAGE_ZERO() pages and various VDSO mappings can * cause them to exist. */ + + trace_page(pfn_to_page(pfn), addr); return pfn_to_page(pfn); } Index: linux/mm/page-writeback.c =================================================================== --- linux.orig/mm/page-writeback.c +++ linux/mm/page-writeback.c @@ -762,8 +762,10 @@ int __set_page_dirty_nobuffers(struct pa struct address_space *mapping = page_mapping(page); struct address_space *mapping2; - if (!mapping) + if (!mapping) { + trace_page(page, 1); return 1; + } write_lock_irq(&mapping->tree_lock); mapping2 = page_mapping(page); @@ -781,8 +783,10 @@ int __set_page_dirty_nobuffers(struct pa /* !PageAnon && !swapper_space */ __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); } + trace_page(page, 1); return 1; } + trace_page(page, 0); return 0; } EXPORT_SYMBOL(__set_page_dirty_nobuffers); @@ -806,6 +810,7 @@ EXPORT_SYMBOL(redirty_page_for_writepage int fastcall set_page_dirty(struct page *page) { struct address_space *mapping = page_mapping(page); + int ret; if (likely(mapping)) { int (*spd)(struct page *) = mapping->a_ops->set_page_dirty; @@ -813,12 +818,17 @@ int fastcall set_page_dirty(struct page if (!spd) spd = __set_page_dirty_buffers; #endif - return (*spd)(page); + ret = (*spd)(page); + trace_page(page, ret); + return ret; } if (!PageDirty(page)) { - if (!TestSetPageDirty(page)) + if (!TestSetPageDirty(page)) { + trace_page(page, 1); return 1; + } } + trace_page(page, 0); return 0; } EXPORT_SYMBOL(set_page_dirty); @@ -840,6 +850,7 @@ int set_page_dirty_lock(struct page *pag lock_page_nosync(page); ret = set_page_dirty(page); unlock_page(page); + trace_page(page, ret); return ret; } EXPORT_SYMBOL(set_page_dirty_lock); @@ -915,13 +926,17 @@ int test_clear_page_writeback(struct pag write_lock_irqsave(&mapping->tree_lock, flags); ret = TestClearPageWriteback(page); - if (ret) + trace_page(page, ret); + if (ret) { radix_tree_tag_clear(&mapping->page_tree, page_index(page), PAGECACHE_TAG_WRITEBACK); + trace_page(page, ret); + } write_unlock_irqrestore(&mapping->tree_lock, flags); } else { ret = TestClearPageWriteback(page); + trace_page(page, ret); } return ret; } @@ -936,17 +951,23 @@ int test_set_page_writeback(struct page write_lock_irqsave(&mapping->tree_lock, flags); ret = TestSetPageWriteback(page); - if (!ret) + trace_page(page, ret); + if (!ret) { radix_tree_tag_set(&mapping->page_tree, page_index(page), PAGECACHE_TAG_WRITEBACK); - if (!PageDirty(page)) + trace_page(page, ret); + } + if (!PageDirty(page)) { radix_tree_tag_clear(&mapping->page_tree, page_index(page), PAGECACHE_TAG_DIRTY); + trace_page(page, ret); + } write_unlock_irqrestore(&mapping->tree_lock, flags); } else { ret = TestSetPageWriteback(page); + trace_page(page, ret); } return ret; Index: linux/mm/page_alloc.c =================================================================== --- linux.orig/mm/page_alloc.c +++ linux/mm/page_alloc.c @@ -1420,6 +1420,8 @@ nopage: show_mem(); } got_pg: + if (page) + trace_page(page, order); return page; } @@ -1468,6 +1470,7 @@ void __pagevec_free(struct pagevec *pvec fastcall void __free_pages(struct page *page, unsigned int order) { if (put_page_testzero(page)) { + trace_page(page, order); if (order == 0) free_hot_page(page); else Index: linux/mm/page_trace.c =================================================================== --- /dev/null +++ linux/mm/page_trace.c @@ -0,0 +1,84 @@ + +#include <linux/seq_file.h> +#include <linux/mm.h> +#include <linux/sched.h> + +void trace_page(struct page *page, unsigned long info) +{ + struct page_trace_entry *entry; + struct stack_trace trace; + unsigned long flags, content; + unsigned long *addr; + + addr = (unsigned long *)page_address(page); + if (addr) + content = *addr; + else + content = 0x12344321; + + trace_special((unsigned long)page, info, content); + trace_special_sym(); + + local_irq_save(flags); + page->trace_idx = (page->trace_idx + 1) % PAGE_TRACE_NR; + entry = page->trace + page->trace_idx; + trace.nr_entries = 0; + trace.max_entries = PAGE_TRACE_DEPTH; + trace.entries = entry->entries; + trace.skip = 3; + trace.all_contexts = 0; + save_stack_trace(&trace, NULL); + entry->nr_entries = trace.nr_entries; + entry->timestamp = jiffies - INITIAL_JIFFIES; + entry->pid = current->pid; + entry->info = info; + entry->content = content; + memcpy(entry->comm, current->comm, TASK_COMM_LEN); + local_irq_restore(flags); +} + +static void print_page_trace_entry(struct seq_file *m, + struct page_trace_entry *entry, int idx) +{ + struct stack_trace trace; + SEQ_printf(m, "#%02d, %06ld.%03ld, %-16s:%d, (#%d): content: %08lx, info: %08lx\n", + idx, entry->timestamp / HZ, entry->timestamp % HZ, entry->comm, entry->pid, + entry->nr_entries, entry->content, entry->info); + + trace.nr_entries = entry->nr_entries; + trace.entries = entry->entries; + print_stack_trace(&trace, 2); + SEQ_printf(m, "\n"); +} + +void print_page_trace(struct seq_file *m, struct page *page) +{ + int i, i0; + + SEQ_printf(m, "printing page %p's events:\n", page); + + i0 = i = page->trace_idx; + do { + i = (i + 1) % PAGE_TRACE_NR; + print_page_trace_entry(m, page->trace + i, i); + } while (i != i0); +} + +static void trace_pte(pte_t pte, unsigned long addr) +{ + unsigned long pfn; + + if (pte_present(pte)) { + pfn = pte_pfn(pte); + if (pfn_valid(pfn)) + trace_page(pfn_to_page(pfn), addr); + } +} + +void trace_set_pte_at(struct mm_struct *mm, unsigned long addr, + pte_t *ptep, pte_t pteval) +{ + trace_pte(*ptep, addr); + set_pte(ptep, pteval); + trace_pte(pteval, addr); +} Index: linux/mm/rmap.c =================================================================== --- linux.orig/mm/rmap.c +++ linux/mm/rmap.c @@ -452,7 +452,7 @@ static int page_mkclean_one(struct page entry = ptep_clear_flush(vma, address, pte); entry = pte_wrprotect(entry); entry = pte_mkclean(entry); - set_pte_at(vma, address, pte, entry); + set_pte_at(mm, address, pte, entry); lazy_mmu_prot_update(entry); ret = 1; } ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [patch] fix data corruption bug in __block_write_full_page() 2007-01-02 12:06 ` Ingo Molnar @ 2007-01-02 12:16 ` Christoph Hellwig 0 siblings, 0 replies; 311+ messages in thread From: Christoph Hellwig @ 2007-01-02 12:16 UTC (permalink / raw) To: Ingo Molnar Cc: Christoph Hellwig, linux-kernel, Linus Torvalds, Andrew Morton On Tue, Jan 02, 2007 at 01:06:34PM +0100, Ingo Molnar wrote: > Find it below - it's ontop of the tracer included in 2.6.20-rc2-rt3. > it's very ad-hoc, based on Linus' test utility. I can write such a > tracer in 30 minutes so i usually throw them away. I literally wrote > dozens of tracer variants for specific bugs in the past few years. Ah, I though this was a general purpose tracer. Question solved, thanks :) I was just tired of writing my own special purpose tracers all the time aswell. ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one 2006-12-28 19:45 ` Andrew Morton 2006-12-28 20:14 ` Linus Torvalds @ 2006-12-28 22:35 ` Mike Galbraith 1 sibling, 0 replies; 311+ messages in thread From: Mike Galbraith @ 2006-12-28 22:35 UTC (permalink / raw) To: Andrew Morton Cc: Linus Torvalds, Guillaume Chazarain, David Miller, ranma, gordonfarquharson, tbm, Peter Zijlstra, andrei.popa, hugh, nickpiggin, arjan, Linux Kernel Mailing List, Chen Kenneth W On Thu, 2006-12-28 at 11:45 -0800, Andrew Morton wrote: > On Thu, 28 Dec 2006 11:28:52 -0800 (PST) > Linus Torvalds <torvalds@osdl.org> wrote: > > > > > > > On Thu, 28 Dec 2006, Guillaume Chazarain wrote: > > > > > > The attached patch fixes the corruption for me. > > > > Well, that's a good hint, but it's really just a symptom. You effectively > > just made the test-program not even try to flush the data to disk, so the > > page cache would stay in memory, and you'd not see the corruption as well. > > > > So you basically disabled the code that tried to trigger the bug more > > easily. > > > > But the reason I say it's interesting is that "WB_SYNC_NONE" is very much > > implicated in mm/page-writeback.c, and if there is a bug triggered by > > WB_SYNC_NONE writebacks, then that would explain why page-writeback.c also > > fails.. > > > > It would be interesting to convert your app to do fsync() before > FADV_DONTNEED. That would take WB_SYNC_NONE out of the picture as well > (apart from pdflush activity). I did fdatasync(), tried remapping before unmapping... nogo here. ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-22 12:32 ` Martin Michlmayr 2006-12-22 12:59 ` Martin Michlmayr @ 2006-12-22 15:01 ` Patrick Mau 2006-12-23 8:15 ` Andrei Popa 2 siblings, 0 replies; 311+ messages in thread From: Patrick Mau @ 2006-12-22 15:01 UTC (permalink / raw) To: Linux Kernel On Fri, Dec 22, 2006 at 01:32:49PM +0100, Martin Michlmayr wrote: > * Andrei Popa <andrei.popa@i-neo.ro> [2006-12-22 14:24]: > > With all three patches I have corruption.... > > I've completed one installation with Linus' patch plus the two from > Andrew successfully, but I'm currently trying again... but I really > need a better testcase since an installation takes about an hour. > Andrei, which torrent do you download as a testcase? It would be good > if someone could suggest a torrent which is legal and not too large. Hi everyone, I have been reading this thread for the last few days, but have been silent. I have 3 torrents here for testing, if you want. You can easily reproduce with "rtorrent", if you: - Have a completly downloaded one, no matter what size - Corrupt the download with dd if=/dev/zero of=download.file bs=16k count=1 - Restart 'rtorrent', hash-check fails - It will download 1 piece that was corrupted. The important part here is that rtorrent transfers one piece, using its own code sequence to write to the file. Let me offer to test until Saturday afternoon CET, I have a cloned git repository, downloaded torrent files and "apt". My systems that are affected are: Linux oscar 2.6.18 SMP (2x450Mhz Intel P3) (rolled back to 2.6.18 but can boot latest git) Linux tony 2.6.20-git UP (can be tested using all kinds of "apt" operations) Both machines are using: IDE -> MD-RAID1 -> LVM -> EXT3 (data=ordered) SCSI -> MD-RAID5 -> ..... I don't want to disturb your technical discussion, just offering some help in testing. Regards, Patrick ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-22 12:32 ` Martin Michlmayr 2006-12-22 12:59 ` Martin Michlmayr 2006-12-22 15:01 ` [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) Patrick Mau @ 2006-12-23 8:15 ` Andrei Popa 2 siblings, 0 replies; 311+ messages in thread From: Andrei Popa @ 2006-12-23 8:15 UTC (permalink / raw) To: Martin Michlmayr Cc: Andrew Morton, Linus Torvalds, Gordon Farquharson, Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven, Linux Kernel Mailing List On Fri, 2006-12-22 at 13:32 +0100, Martin Michlmayr wrote: > * Andrei Popa <andrei.popa@i-neo.ro> [2006-12-22 14:24]: > > With all three patches I have corruption.... > > I've completed one installation with Linus' patch plus the two from > Andrew successfully, but I'm currently trying again... but I really > need a better testcase since an installation takes about an hour. > Andrei, which torrent do you download as a testcase? It would be good > if someone could suggest a torrent which is legal and not too large. It's a 1.4GB file torrent split in 84 rar files and there are many seeders. I download with ~ 5MB/sec. The torrent is private. ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-22 4:54 ` Linus Torvalds 2006-12-22 10:00 ` Martin Michlmayr @ 2006-12-22 15:08 ` Gordon Farquharson 1 sibling, 0 replies; 311+ messages in thread From: Gordon Farquharson @ 2006-12-22 15:08 UTC (permalink / raw) To: Linus Torvalds Cc: Andrew Morton, Martin Michlmayr, Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven, Andrei Popa, Linux Kernel Mailing List, Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens, Arnd Bergmann On 12/21/06, Linus Torvalds <torvalds@osdl.org> wrote: > Andrew located at least one bug: we run cancel_dirty_page() too late in > "truncate_complete_page()", which means that do_invalidatepage() ends up > not clearing the page cache. > > His patch is appended. Thanks. I'll try this out later today. > But it sounds like I probably misunderstood something, because I thought > that Martin had acknowledged that this patch actually worked for him. > Which sounded very similar to your setup (he has a 32M ARM box too, no?) Yup, we have the same machines (Linksys NSLU2) and are running the same test case (installing Debian). However, I'm not sure what kernel version he had used for his latest test. I presumed 2.6.20-git, whereas I had used 2.6.19. > Maybe it's mount option issue? I've got data=ordered on my machine, are > you perhaps runnign with something else? We are also using ordered. /dev/scsi/host0/bus0/target0/lun0/part1 /target ext3 rw,data=ordered 0 0 Gordon -- Gordon Farquharson ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-22 4:20 ` Gordon Farquharson 2006-12-22 4:54 ` Linus Torvalds @ 2006-12-22 10:01 ` Martin Michlmayr 2006-12-22 15:16 ` Gordon Farquharson 1 sibling, 1 reply; 311+ messages in thread From: Martin Michlmayr @ 2006-12-22 10:01 UTC (permalink / raw) To: Gordon Farquharson Cc: Andrew Morton, Linus Torvalds, Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven, Andrei Popa, Linux Kernel Mailing List, Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens, Arnd Bergmann * Gordon Farquharson <gordonfarquharson@gmail.com> [2006-12-21 21:20]: > generating these files, pkgcache.bin grows to 12582912 bytes, and when > apt-get finishes, pkgcache.bin is 6425533 bytes and srcpkgcache.bin is > 64254483 bytes. This time, when apt-get exited, it had only created > pkgcache.bin which was still 12582912 bytes. Yes, same here: sh-3.1# ls -l /var/cache/apt/ total 5252 drwxr-xr-x 3 root root 12288 Dec 22 04:41 archives -rw-r--r-- 1 root root 12582912 Dec 22 04:45 pkgcache.bin -rw-r--r-- 1 root root 8554 Dec 22 04:45 srcpkgcache.bin Gordon, does it fail for you where it normally does (installing initramfs-tools) or much later? For me, the installer was able to install initramfs-tools and the kernel, but apt now hangs at "Select and install software". -- Martin Michlmayr http://www.cyrius.com/ ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-22 10:01 ` Martin Michlmayr @ 2006-12-22 15:16 ` Gordon Farquharson 0 siblings, 0 replies; 311+ messages in thread From: Gordon Farquharson @ 2006-12-22 15:16 UTC (permalink / raw) To: Martin Michlmayr Cc: Andrew Morton, Linus Torvalds, Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven, Andrei Popa, Linux Kernel Mailing List, Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens, Arnd Bergmann On 12/22/06, Martin Michlmayr <tbm@cyrius.com> wrote: > sh-3.1# ls -l /var/cache/apt/ > total 5252 > drwxr-xr-x 3 root root 12288 Dec 22 04:41 archives > -rw-r--r-- 1 root root 12582912 Dec 22 04:45 pkgcache.bin > -rw-r--r-- 1 root root 8554 Dec 22 04:45 srcpkgcache.bin This listing is a little different to what I got. For me, srcpkgcache.bin did not exist when apt eventually finished. Did you notice whether the install took a lot longer than usual ? > Gordon, does it fail for you where it normally does (installing > initramfs-tools) or much later? For me, the installer was able to > install initramfs-tools and the kernel, but apt now hangs at "Select > and install software". apt didn't hang for me, it just took 20 to 30 minutes to complete building the package database. Usually, it takes less than a minute. The installer stopped because it could not find a kernel to install. I have seen this failure mde before, and as you have previously pointed out, is probably the same problem (corrupted apt cache files), just a different manifestation. Gordon -- Gordon Farquharson ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-21 7:53 ` Linus Torvalds 2006-12-21 8:38 ` Martin Michlmayr 2006-12-21 9:17 ` Gordon Farquharson @ 2006-12-21 12:30 ` Russell King 2006-12-21 12:36 ` Russell King 2 siblings, 1 reply; 311+ messages in thread From: Russell King @ 2006-12-21 12:30 UTC (permalink / raw) To: Linus Torvalds Cc: Gordon Farquharson, Martin Michlmayr, Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven, Andrei Popa, Andrew Morton, Linux Kernel Mailing List, Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens, Arnd Bergmann On Wed, Dec 20, 2006 at 11:53:25PM -0800, Linus Torvalds wrote: > That's obviously a bug worth fixing on its own. Do you know when it > started? My last merge, just before 2.6.19-rc1. -- Russell King Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-21 12:30 ` Russell King @ 2006-12-21 12:36 ` Russell King 0 siblings, 0 replies; 311+ messages in thread From: Russell King @ 2006-12-21 12:36 UTC (permalink / raw) To: Linus Torvalds, Gordon Farquharson, Martin Michlmayr, Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven, Andrei Popa, Andrew Morton, Linux Kernel Mailing List, Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens, Arnd Bergmann On Thu, Dec 21, 2006 at 12:30:22PM +0000, Russell King wrote: > On Wed, Dec 20, 2006 at 11:53:25PM -0800, Linus Torvalds wrote: > > That's obviously a bug worth fixing on its own. Do you know when it > > started? > > My last merge, just before 2.6.19-rc1. Obviously 2.6.20-rc1. -- Russell King Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-20 19:50 ` Linus Torvalds ` (5 preceding siblings ...) 2006-12-21 7:32 ` Gordon Farquharson @ 2006-12-21 11:21 ` Martin Michlmayr 6 siblings, 0 replies; 311+ messages in thread From: Martin Michlmayr @ 2006-12-21 11:21 UTC (permalink / raw) To: Linus Torvalds Cc: Peter Zijlstra, Hugh Dickins, Nick Piggin, Arjan van de Ven, Andrei Popa, Andrew Morton, Linux Kernel Mailing List, Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens, Arnd Bergmann, gordonfarquharson * Linus Torvalds <torvalds@osdl.org> [2006-12-20 11:50]: > Martin, Andrei, does this make any difference for your corruption > cases? Works for me. -- Martin Michlmayr http://www.cyrius.com/ ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-20 17:03 ` Martin Michlmayr 2006-12-20 17:35 ` Linus Torvalds @ 2006-12-20 22:11 ` Russell King 2006-12-21 8:18 ` Martin Michlmayr 1 sibling, 1 reply; 311+ messages in thread From: Russell King @ 2006-12-20 22:11 UTC (permalink / raw) To: Martin Michlmayr Cc: Peter Zijlstra, Hugh Dickins, Arjan van de Ven, Linus Torvalds, Andrei Popa, Andrew Morton, Linux Kernel Mailing List, Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens, Arnd Bergmann, gordonfarquharson On Wed, Dec 20, 2006 at 06:03:23PM +0100, Martin Michlmayr wrote: > * Peter Zijlstra <a.p.zijlstra@chello.nl> [2006-12-20 14:56]: > > page_mkclean_one() fix > > This patch doesn't fix my problem (apt segfaults on ARM because its > database is corrupted). Are you using IDE in PIO mode? If so, the bug probably lies there. As I've said repeatedly when asked by IDE folk to test their PIO-based cache coherency fixes, I am unable to reproduce the bug, ergo I am unable to test the fix. (Some people, such as Jeff Garzik to name names, took that as me being entirely unreasonable and un-cooperative. But consider carefully - how can _anyone_ test something that they can't produce. I consider Jeff's comments extremely very childish in that respect.) Hence, as far as I'm aware, Linux on PIO-based IDE ARM hardware remains utterly *unsafe*. Sorry. -- Russell King Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-20 22:11 ` Russell King @ 2006-12-21 8:18 ` Martin Michlmayr 2006-12-21 9:54 ` Russell King 0 siblings, 1 reply; 311+ messages in thread From: Martin Michlmayr @ 2006-12-21 8:18 UTC (permalink / raw) To: rmk+lkml, Peter Zijlstra, Hugh Dickins, Arjan van de Ven, Linus Torvalds, Andrei Popa, Andrew Morton, Linux Kernel Mailing List, Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens, Arnd Bergmann, gordonfarquharson * Russell King <rmk+lkml@arm.linux.org.uk> [2006-12-20 22:11]: > > This patch doesn't fix my problem (apt segfaults on ARM because its > > database is corrupted). > > Are you using IDE in PIO mode? If so, the bug probably lies there. I'm using usb-storage. It's used to access an external IDE drive in an USB enclosure but I don't think it matters that it's IDE since we're using the SCSI layer to talk to it, right? -- Martin Michlmayr http://www.cyrius.com/ ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-21 8:18 ` Martin Michlmayr @ 2006-12-21 9:54 ` Russell King 0 siblings, 0 replies; 311+ messages in thread From: Russell King @ 2006-12-21 9:54 UTC (permalink / raw) To: Martin Michlmayr Cc: Peter Zijlstra, Hugh Dickins, Arjan van de Ven, Linus Torvalds, Andrei Popa, Andrew Morton, Linux Kernel Mailing List, Florian Weimer, Marc Haber, Martin Schwidefsky, Heiko Carstens, Arnd Bergmann, gordonfarquharson On Thu, Dec 21, 2006 at 09:18:45AM +0100, Martin Michlmayr wrote: > * Russell King <rmk+lkml@arm.linux.org.uk> [2006-12-20 22:11]: > > > This patch doesn't fix my problem (apt segfaults on ARM because its > > > database is corrupted). > > > > Are you using IDE in PIO mode? If so, the bug probably lies there. > > I'm using usb-storage. It's used to access an external IDE drive in > an USB enclosure but I don't think it matters that it's IDE since > we're using the SCSI layer to talk to it, right? USB generally uses DMA so you're probably safe. -- Russell King Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/ maintainer of: ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) 2006-12-20 11:26 ` [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) Peter Zijlstra 2006-12-20 11:39 ` Jesper Juhl 2006-12-20 13:00 ` Hugh Dickins @ 2006-12-20 14:55 ` Martin Schwidefsky 2 siblings, 0 replies; 311+ messages in thread From: Martin Schwidefsky @ 2006-12-20 14:55 UTC (permalink / raw) To: Peter Zijlstra Cc: Arjan van de Ven, Linus Torvalds, Andrei Popa, Andrew Morton, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr, Heiko Carstens, Arnd Bergmann On Wed, 2006-12-20 at 12:26 +0100, Peter Zijlstra wrote: > fix page_mkclean_one() > > it had several issues: > - it failed to flush the cache > - it failed to flush the tlb > - it failed to do s390 (s390 guys, please verify this is now correct) Sorry, page_mkclean is broken for s390. But it has already been broken before your change. It is only more broken now. > @@ -440,22 +440,23 @@ static int page_mkclean_one(struct page > if (address == -EFAULT) > goto out; > > - pte = page_check_address(page, mm, address, &ptl); > - if (!pte) > + ptep = page_check_address(page, mm, address, &ptl); > + if (!ptep) > goto out; > > - if (!pte_dirty(*pte) && !pte_write(*pte)) > - goto unlock; > - > - entry = ptep_get_and_clear(mm, address, pte); > - entry = pte_mkclean(entry); > - entry = pte_wrprotect(entry); > - ptep_establish(vma, address, pte, entry); > - lazy_mmu_prot_update(entry); > - ret = 1; > + while (pte_dirty(*ptep) || pte_write(*ptep)) { > + pte_t entry = ptep_get_and_clear(mm, address, ptep); > + flush_cache_page(vma, address, pte_pfn(entry)); > + flush_tlb_page(vma, address); > + (void)page_test_and_clear_dirty(page); /* do the s390 thing */ > + entry = pte_wrprotect(entry); > + entry = pte_mkclean(entry); > + set_pte_at(vma, address, ptep, entry); > + lazy_mmu_prot_update(entry); > + ret = 1; > + } > > -unlock: > - pte_unmap_unlock(pte, ptl); > + pte_unmap_unlock(ptep, ptl); > out: > return ret; > } 1) pte_dirty() is always false. The reason is that s390 keeps the dirty bit information in the storage key and not the pte. If pte_write is false as well nothing is done. There really should be a if (page_test_and_clear_dirty(page)) ret = 1; at the end of page_mkclean. 2) Please use ptep_clear_flush instead of ptep_get_and_clear + flush_tlb_page. The former uses an optimization on s390 that flushes just one TLB, the later flushes every TLB of the current mm. My try to fix this up is attached. It moves the flush_cache_page after the flush_tlb_page (see asm-generic/pgtable.h for the generic definition of ptep_clear_flush that is used for i386). I hope this doesn't break anything else. -- blue skies, Martin. Martin Schwidefsky Linux for zSeries Development & Services IBM Deutschland Entwicklung GmbH "Reality continues to ruin my life." - Calvin. --- mm/rmap.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff -urpN linux-2.6/mm/rmap.c linux-2.6-mkclean/mm/rmap.c --- linux-2.6/mm/rmap.c 2006-12-20 15:49:01.000000000 +0100 +++ linux-2.6-mkclean/mm/rmap.c 2006-12-20 15:51:14.000000000 +0100 @@ -445,10 +445,8 @@ static int page_mkclean_one(struct page goto out; while (pte_dirty(*ptep) || pte_write(*ptep)) { - pte_t entry = ptep_get_and_clear(mm, address, ptep); + pte_t entry = ptep_clear_flush(vma, address, ptep); flush_cache_page(vma, address, pte_pfn(entry)); - flush_tlb_page(vma, address); - (void)page_test_and_clear_dirty(page); /* do the s390 thing */ entry = pte_wrprotect(entry); entry = pte_mkclean(entry); set_pte_at(vma, address, ptep, entry); @@ -490,6 +488,8 @@ int page_mkclean(struct page *page) if (mapping) ret = page_mkclean_file(mapping, page); } + if (page_test_and_clear_dirty(page)) + ret = 1; return ret; } ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-20 9:01 ` Peter Zijlstra 2006-12-20 9:12 ` Peter Zijlstra 2006-12-20 9:39 ` Arjan van de Ven @ 2006-12-20 14:27 ` Martin Schwidefsky 2 siblings, 0 replies; 311+ messages in thread From: Martin Schwidefsky @ 2006-12-20 14:27 UTC (permalink / raw) To: Peter Zijlstra Cc: Linus Torvalds, Andrei Popa, Andrew Morton, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr, Heiko Carstens On Wed, 2006-12-20 at 10:01 +0100, Peter Zijlstra wrote: > Also, what is this page_test_and_clear_dirty() business, that seems to > be exclusively s390 btw. However they do seem to need this. > > > But the "ptep_get_and_clear() + flush_tlb_page()" sequence should > > hopefully also work. > > Yeah, probably, not optimally so on some archs that don't actually need > the flush though. And as above, I wonder about s390. Simple, the s390 architecture does not keep the dirty bit in the pte but in something called the storage key. For each physical page there is one associated storage key. It is accessed with special instructions like "iske", "sske" or "rrbe". To clear the dirty bit the storage key of a page is read with iske, the bit is cleared and the storage key is stored back with sske. That means that clearing the dirty bit is not an atomic operation. rrbe is used to test and clear the referenced bit (young/old infomation) and is atomic in regard to other storage key operations. If you think about it, the storage keys are quite nice for the operating system, page_referenced() can be implemented with a single test "page_test_and_clear_young()". No need to read all the ptes pointing to the page. The downside is that the storage keys have a cost on the hardware side. -- blue skies, Martin. Martin Schwidefsky Linux for zSeries Development & Services IBM Deutschland Entwicklung GmbH "Reality continues to ruin my life." - Calvin. ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-20 0:23 ` Linus Torvalds 2006-12-20 9:01 ` Peter Zijlstra @ 2006-12-20 9:32 ` Peter Zijlstra 1 sibling, 0 replies; 311+ messages in thread From: Peter Zijlstra @ 2006-12-20 9:32 UTC (permalink / raw) To: Linus Torvalds Cc: Andrei Popa, Andrew Morton, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Tue, 2006-12-19 at 16:23 -0800, Linus Torvalds wrote: > Pls test. Is good. Only s390 remains a question. Another point, change_protection() also does a cache flush, should we too? > ---- > diff --git a/mm/rmap.c b/mm/rmap.c > index d8a842a..eec8706 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -448,9 +448,10 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma) > goto unlock; > > entry = ptep_get_and_clear(mm, address, pte); flush_cache_page(vma, address, pte_pfn(entry)); > + flush_tlb_page(vma, address); > entry = pte_mkclean(entry); > entry = pte_wrprotect(entry); > - ptep_establish(vma, address, pte, entry); > + set_pte_at(mm, address, pte, entry); > lazy_mmu_prot_update(entry); > ret = 1; > > ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-19 23:42 ` Peter Zijlstra 2006-12-20 0:23 ` Linus Torvalds @ 2006-12-20 14:15 ` Andrei Popa 2006-12-20 14:23 ` Peter Zijlstra 1 sibling, 1 reply; 311+ messages in thread From: Andrei Popa @ 2006-12-20 14:15 UTC (permalink / raw) To: Peter Zijlstra Cc: Linus Torvalds, Andrew Morton, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Wed, 2006-12-20 at 00:42 +0100, Peter Zijlstra wrote: > On Mon, 2006-12-18 at 12:14 -0800, Linus Torvalds wrote: > > > OR: > > > > - page_mkclean_one() is simply buggy. > > GOLD! > > it seems to work with all this (full diff against current git). > > /me rebuilds full kernel to make sure... > reboot... > test... pff the tension... > yay, still good! > > Andrei; would you please verify. I have corrupted files. > The magic seems to be in the extra tlb flush after clearing the dirty > bit. Just too bad ptep_clear_flush_dirty() needs ptep not entry. > > diff --git a/drivers/connector/connector.c b/drivers/connector/connector.c > index 5e7cd45..2b8893b 100644 > --- a/drivers/connector/connector.c > +++ b/drivers/connector/connector.c > @@ -135,8 +135,7 @@ static int cn_call_callback(struct cn_msg *msg, void (*destruct_data)(void *), v > spin_lock_bh(&dev->cbdev->queue_lock); > list_for_each_entry(__cbq, &dev->cbdev->queue_list, callback_entry) { > if (cn_cb_equal(&__cbq->id.id, &msg->id)) { > - if (likely(!test_bit(WORK_STRUCT_PENDING, > - &__cbq->work.work.management) && > + if (likely(!delayed_work_pending(&__cbq->work) && > __cbq->data.ddata == NULL)) { > __cbq->data.callback_priv = msg; > > diff --git a/fs/buffer.c b/fs/buffer.c > index d1f1b54..263f88e 100644 > --- a/fs/buffer.c > +++ b/fs/buffer.c > @@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *page) > int ret = 0; > > BUG_ON(!PageLocked(page)); > - if (PageWriteback(page)) > + if (PageDirty(page) || PageWriteback(page)) > return 0; > > if (mapping == NULL) { /* can this still happen? */ > @@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *page) > spin_lock(&mapping->private_lock); > ret = drop_buffers(page, &buffers_to_free); > spin_unlock(&mapping->private_lock); > - if (ret) { > - /* > - * If the filesystem writes its buffers by hand (eg ext3) > - * then we can have clean buffers against a dirty page. We > - * clean the page here; otherwise later reattachment of buffers > - * could encounter a non-uptodate page, which is unresolvable. > - * This only applies in the rare case where try_to_free_buffers > - * succeeds but the page is not freed. > - * > - * Also, during truncate, discard_buffer will have marked all > - * the page's buffers clean. We discover that here and clean > - * the page also. > - */ > - if (test_clear_page_dirty(page)) > - task_io_account_cancelled_write(PAGE_CACHE_SIZE); > - } > out: > if (buffers_to_free) { > struct buffer_head *bh = buffers_to_free; > diff --git a/mm/memory.c b/mm/memory.c > index c00bac6..60e0945 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -1842,6 +1842,33 @@ void unmap_mapping_range(struct address_space *mapping, > } > EXPORT_SYMBOL(unmap_mapping_range); > > +static void check_last_page(struct address_space *mapping, loff_t size) > +{ > + pgoff_t index; > + unsigned int offset; > + struct page *page; > + > + if (!mapping) > + return; > + offset = size & ~PAGE_MASK; > + if (!offset) > + return; > + index = size >> PAGE_SHIFT; > + page = find_lock_page(mapping, index); > + if (page) { > + unsigned int check = 0; > + unsigned char *kaddr = kmap_atomic(page, KM_USER0); > + do { > + check += kaddr[offset++]; > + } while (offset < PAGE_SIZE); > + kunmap_atomic(kaddr, KM_USER0); > + unlock_page(page); > + page_cache_release(page); > + if (check) > + printk(KERN_ERR "%s: BADNESS: truncate check %u\n", current->comm, check); > + } > +} > + > /** > * vmtruncate - unmap mappings "freed" by truncate() syscall > * @inode: inode of the file used > @@ -1875,6 +1902,7 @@ do_expand: > goto out_sig; > if (offset > inode->i_sb->s_maxbytes) > goto out_big; > + check_last_page(mapping, inode->i_size); > i_size_write(inode, offset); > > out_truncate: > diff --git a/mm/page-writeback.c b/mm/page-writeback.c > index 237107c..f561e72 100644 > --- a/mm/page-writeback.c > +++ b/mm/page-writeback.c > @@ -957,7 +957,7 @@ int test_set_page_writeback(struct page *page) > EXPORT_SYMBOL(test_set_page_writeback); > > /* > - * Return true if any of the pages in the mapping are marged with the > + * Return true if any of the pages in the mapping are marked with the > * passed tag. > */ > int mapping_tagged(struct address_space *mapping, int tag) > diff --git a/mm/rmap.c b/mm/rmap.c > index d8a842a..900229a 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -432,7 +432,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma) > { > struct mm_struct *mm = vma->vm_mm; > unsigned long address; > - pte_t *pte, entry; > + pte_t *ptep, entry; > spinlock_t *ptl; > int ret = 0; > > @@ -440,22 +440,23 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma) > if (address == -EFAULT) > goto out; > > - pte = page_check_address(page, mm, address, &ptl); > - if (!pte) > + ptep = page_check_address(page, mm, address, &ptl); > + if (!ptep) > goto out; > > - if (!pte_dirty(*pte) && !pte_write(*pte)) > + if (!pte_dirty(*ptep) && !pte_write(*ptep)) > goto unlock; > > - entry = ptep_get_and_clear(mm, address, pte); > - entry = pte_mkclean(entry); > + entry = ptep_get_and_clear(mm, address, ptep); > entry = pte_wrprotect(entry); > - ptep_establish(vma, address, pte, entry); > + ptep_establish(vma, address, ptep, entry); > + ret = ptep_clear_flush_dirty(vma, address, ptep) || > + page_test_and_clear_dirty(page); > lazy_mmu_prot_update(entry); > ret = 1; > > unlock: > - pte_unmap_unlock(pte, ptl); > + pte_unmap_unlock(ptep, ptl); > out: > return ret; > } > > ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-20 14:15 ` Andrei Popa @ 2006-12-20 14:23 ` Peter Zijlstra 2006-12-20 16:30 ` Andrei Popa 0 siblings, 1 reply; 311+ messages in thread From: Peter Zijlstra @ 2006-12-20 14:23 UTC (permalink / raw) To: andrei.popa Cc: Linus Torvalds, Andrew Morton, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Wed, 2006-12-20 at 16:15 +0200, Andrei Popa wrote: > On Wed, 2006-12-20 at 00:42 +0100, Peter Zijlstra wrote: > > On Mon, 2006-12-18 at 12:14 -0800, Linus Torvalds wrote: > > > > > OR: > > > > > > - page_mkclean_one() is simply buggy. > > > > GOLD! > > > > it seems to work with all this (full diff against current git). > > > > /me rebuilds full kernel to make sure... > > reboot... > > test... pff the tension... > > yay, still good! > > > > Andrei; would you please verify. > > I have corrupted files. drad; and with this patch: http://lkml.org/lkml/2006/12/20/112 /me goes rebuild his kernel and try more than 3 times ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-20 14:23 ` Peter Zijlstra @ 2006-12-20 16:30 ` Andrei Popa 2006-12-20 16:36 ` Peter Zijlstra 0 siblings, 1 reply; 311+ messages in thread From: Andrei Popa @ 2006-12-20 16:30 UTC (permalink / raw) To: Peter Zijlstra Cc: Linus Torvalds, Andrew Morton, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Wed, 2006-12-20 at 15:23 +0100, Peter Zijlstra wrote: > On Wed, 2006-12-20 at 16:15 +0200, Andrei Popa wrote: > > On Wed, 2006-12-20 at 00:42 +0100, Peter Zijlstra wrote: > > > On Mon, 2006-12-18 at 12:14 -0800, Linus Torvalds wrote: > > > > > > > OR: > > > > > > > > - page_mkclean_one() is simply buggy. > > > > > > GOLD! > > > > > > it seems to work with all this (full diff against current git). > > > > > > /me rebuilds full kernel to make sure... > > > reboot... > > > test... pff the tension... > > > yay, still good! > > > > > > Andrei; would you please verify. > > > > I have corrupted files. > > drad; and with this patch: > http://lkml.org/lkml/2006/12/20/112 Hash check on download completion found bad chunks, consider using "safe_sync". > > /me goes rebuild his kernel and try more than 3 times > ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-20 16:30 ` Andrei Popa @ 2006-12-20 16:36 ` Peter Zijlstra 0 siblings, 0 replies; 311+ messages in thread From: Peter Zijlstra @ 2006-12-20 16:36 UTC (permalink / raw) To: andrei.popa Cc: Linus Torvalds, Andrew Morton, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Wed, 2006-12-20 at 18:30 +0200, Andrei Popa wrote: > On Wed, 2006-12-20 at 15:23 +0100, Peter Zijlstra wrote: > > On Wed, 2006-12-20 at 16:15 +0200, Andrei Popa wrote: > > > On Wed, 2006-12-20 at 00:42 +0100, Peter Zijlstra wrote: > > > > On Mon, 2006-12-18 at 12:14 -0800, Linus Torvalds wrote: > > > > > > > > > OR: > > > > > > > > > > - page_mkclean_one() is simply buggy. > > > > > > > > GOLD! > > > > > > > > it seems to work with all this (full diff against current git). > > > > > > > > /me rebuilds full kernel to make sure... > > > > reboot... > > > > test... pff the tension... > > > > yay, still good! > > > > > > > > Andrei; would you please verify. > > > > > > I have corrupted files. > > > > drad; and with this patch: > > http://lkml.org/lkml/2006/12/20/112 > > Hash check on download completion found bad chunks, consider using > "safe_sync". *sigh* back to square 1. and I need to look at my reproduction case ;-( Thanks for testing. ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-18 19:18 ` Linus Torvalds 2006-12-18 19:44 ` Andrei Popa @ 2006-12-19 7:38 ` Peter Zijlstra 1 sibling, 0 replies; 311+ messages in thread From: Peter Zijlstra @ 2006-12-19 7:38 UTC (permalink / raw) To: Linus Torvalds Cc: Andrei Popa, Andrew Morton, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Mon, 2006-12-18 at 11:18 -0800, Linus Torvalds wrote: > > diff --git a/mm/rmap.c b/mm/rmap.c > > index d8a842a..3f9061e 100644 > > --- a/mm/rmap.c > > +++ b/mm/rmap.c > > @@ -448,7 +448,7 @@ static int page_mkclean_one(struct page > > goto unlock; > > > > entry = ptep_get_and_clear(mm, address, pte); > > - entry = pte_mkclean(entry); > > + /*entry = pte_mkclean(entry);*/ > > entry = pte_wrprotect(entry); > > ptep_establish(vma, address, pte, entry); > > lazy_mmu_prot_update(entry); > > The above patch is bad. It's always going to hide the bug, but it hides it > by just not doing anything at all. Not quite, it does wrprotect still, so further updates will trigger the do_wp_page() path and call set_page_dirty(). So we could make 'something' that would keep the tracking working and not create corruption, say something like this: However I'll try and figure out how we get so terribly confused on the PG_dirty state that we have to clean it and fall back to pte_dirty. That is the real issue we have. --- include/linux/rmap.h | 6 ++++++ mm/page-writeback.c | 3 ++- mm/rmap.c | 23 ++++++++++++++++++----- 3 files changed, 26 insertions(+), 6 deletions(-) Index: linux-2.6-git/mm/rmap.c =================================================================== --- linux-2.6-git.orig/mm/rmap.c 2006-12-18 11:06:29.000000000 +0100 +++ linux-2.6-git/mm/rmap.c 2006-12-19 08:33:57.000000000 +0100 @@ -428,7 +428,8 @@ int page_referenced(struct page *page, i return referenced; } -static int page_mkclean_one(struct page *page, struct vm_area_struct *vma) +static int page_mkcw_one(struct page *page, + struct vm_area_struct *vma, int make_clean) { struct mm_struct *mm = vma->vm_mm; unsigned long address; @@ -448,7 +449,8 @@ static int page_mkclean_one(struct page goto unlock; entry = ptep_get_and_clear(mm, address, pte); - entry = pte_mkclean(entry); + if (make_clean) + entry = pte_mkclean(entry); entry = pte_wrprotect(entry); ptep_establish(vma, address, pte, entry); lazy_mmu_prot_update(entry); @@ -460,7 +462,8 @@ out: return ret; } -static int page_mkclean_file(struct address_space *mapping, struct page *page) +static int page_mkcw_file(struct address_space *mapping, + struct page *page, int make_clean) { pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); struct vm_area_struct *vma; @@ -478,7 +481,7 @@ static int page_mkclean_file(struct addr return ret; } -int page_mkclean(struct page *page) +static int page_mkcw(struct page *page, int make_clean) { int ret = 0; @@ -487,12 +490,22 @@ int page_mkclean(struct page *page) if (page_mapped(page)) { struct address_space *mapping = page_mapping(page); if (mapping) - ret = page_mkclean_file(mapping, page); + ret = page_mkcw_file(mapping, page, make_clean); } return ret; } +int page_mkclean(struct page *page) +{ + return page_mkcw(page, 1); +} + +int page_wrprotect(struct page *page) +{ + return page_mkcw(page, 0); +} + /** * page_set_anon_rmap - setup new anonymous rmap * @page: the page to add the mapping to Index: linux-2.6-git/include/linux/rmap.h =================================================================== --- linux-2.6-git.orig/include/linux/rmap.h 2006-12-19 08:31:59.000000000 +0100 +++ linux-2.6-git/include/linux/rmap.h 2006-12-19 08:32:28.000000000 +0100 @@ -110,6 +110,7 @@ unsigned long page_address_in_vma(struct * returns the number of cleaned PTEs. */ int page_mkclean(struct page *); +int page_wrprotect(struct page *); #else /* !CONFIG_MMU */ @@ -125,6 +126,11 @@ static inline int page_mkclean(struct pa return 0; } +static inline int page_wrprotect(struct page *page) +{ + return 0; +} + #endif /* CONFIG_MMU */ Index: linux-2.6-git/mm/page-writeback.c =================================================================== --- linux-2.6-git.orig/mm/page-writeback.c 2006-12-19 08:24:48.000000000 +0100 +++ linux-2.6-git/mm/page-writeback.c 2006-12-19 08:31:43.000000000 +0100 @@ -872,7 +872,8 @@ int test_clear_page_dirty(struct page *p * page is locked, which pins the address_space */ if (mapping_cap_account_dirty(mapping)) { - page_mkclean(page); + if (page_wrprotect(page)) + set_page_dirty(); dec_zone_page_state(page, NR_FILE_DIRTY); } return 1; ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-18 18:03 ` Linus Torvalds 2006-12-18 18:24 ` Peter Zijlstra @ 2006-12-19 4:36 ` Nick Piggin 2006-12-19 6:34 ` Linus Torvalds 2006-12-19 7:22 ` Peter Zijlstra 1 sibling, 2 replies; 311+ messages in thread From: Nick Piggin @ 2006-12-19 4:36 UTC (permalink / raw) To: Linus Torvalds Cc: Peter Zijlstra, Andrew Morton, andrei.popa, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr [-- Attachment #1: Type: text/plain, Size: 3403 bytes --] Linus Torvalds wrote: > On Mon, 18 Dec 2006, Peter Zijlstra wrote: > >>This should be safe; page_mkclean walks the rmap and flips the pte's >>under the pte lock and records the dirty state while iterating. >>Concurrent faults will either do set_page_dirty() before we get around >>to doing it or vice versa, but dirty state is not lost. > > > Ok, I really liked this patch, but the more I thought about it, the more I > started to doubt the reasons for liking it. Well this implements my suggestion to redirty the page if there were dirty ptes. I think it is a good fix (whether or not it fixes Andrei's bug, it does fix a bug), though maybe _slightly_ suboptimal. > I think we have some core fundamental problem here that this patch is > needed at all. > > So let's think about this: we apparently have two cases of > "clear_page_dirty()": > > - the one that really wants to clear the bit unconditionally (Andrew > calls this the "must_clean_ptes" case, which I personally find to be a > really confusing name, but whatever) > > - the other case. The case that doesn't want to really clear the pte > dirty bits. I don't think this characterises it correctly. Think about how it worked before the page_mkclean went in there. We really _never_ want to just clear pte dirty bits, because that would be a data loss situation[*]. The only reason we clear PG_dirty is because some filesystem may have cleaned each buffer without realising it has cleaned the whole page. But if you have a dirty pte, then all bets are off: a buffer with a clear dirty bit can not be considered clean. Before the dirty page tracking, it was fine to clear PG_dirty here, because we would pick up the pte dirty info later on. After the page dirty tracking, clearing pte dirty is a bug here, and re-accounting the dirty page is arguably the minimal fix. [*] except in the truncate case where we are happy to throw out dirty data, but in that case there would be no ptes anyway. The only thing I would suggest is not applying Andrew's patch at all, and do the special casing in try_to_free_buffers(). I've attached a patch for comments. > and I thought your patch made sense, because it saved away the pte state > in the page dirty state, and that matches my mental model, but the more I > think about it, the less sense that whole "the other case" situation makes > AT ALL. > > Why does "the other case" exist at all? If you want to clear the dirty > page flag, what is _ever_ the reason for not wanting to drop PTE dirty > information? In other words, what possible reason can there ever be for > saying "I want this page to be clean", while at the same time saying "but > if it was dirty in the page tables, don't forget about that state". We never want to drop dirty data! (ignoring the truncate case, which is handled privately by truncate anyway) This whole exercise is not about cleaning or dirtying or fogetting the actual *data* in the page. It is about bringing the pagecache's notion of whether the page is dirty or clean in line with the (more uptodate) filesystem's notion. After dirty write accounting, we also threw in "the virtual memory manager's notion", but got that case slightly wrong. As unlikely as this race is for SMP systems, I think it is easily possible for PREEMPT kernels. And they have featured in all bug reports, AFAIKS. -- SUSE Labs, Novell Inc. [-- Attachment #2: fs-fix.patch --] [-- Type: text/plain, Size: 3904 bytes --] Index: linux-2.6/fs/buffer.c =================================================================== --- linux-2.6.orig/fs/buffer.c 2006-12-19 15:15:46.000000000 +1100 +++ linux-2.6/fs/buffer.c 2006-12-19 15:36:01.000000000 +1100 @@ -2852,7 +2852,17 @@ int try_to_free_buffers(struct page *pag * This only applies in the rare case where try_to_free_buffers * succeeds but the page is not freed. */ - clear_page_dirty(page); + + /* + * If the page has been dirtied via the user mappings, then + * clean buffers does not indicate the page data is actually + * clean! Only clear the page dirty bit if there are no dirty + * ptes either. + * + * If there are dirty ptes, then the page must be uptodate, so + * the above concern does not apply. + */ + clear_page_dirty_sync_ptes(page); } out: if (buffers_to_free) { Index: linux-2.6/include/linux/page-flags.h =================================================================== --- linux-2.6.orig/include/linux/page-flags.h 2006-12-19 15:17:18.000000000 +1100 +++ linux-2.6/include/linux/page-flags.h 2006-12-19 15:34:24.000000000 +1100 @@ -254,6 +254,7 @@ static inline void SetPageUptodate(struc struct page; /* forward declaration */ int test_clear_page_dirty(struct page *page); +int test_clear_page_dirty_sync_ptes(struct page *page); int test_clear_page_writeback(struct page *page); int test_set_page_writeback(struct page *page); @@ -262,6 +263,11 @@ static inline void clear_page_dirty(stru test_clear_page_dirty(page); } +static inline void clear_page_dirty_sync_ptes(struct page *page) +{ + test_clear_page_dirty_sync_ptes(page); +} + static inline void set_page_writeback(struct page *page) { test_set_page_writeback(page); Index: linux-2.6/mm/page-writeback.c =================================================================== --- linux-2.6.orig/mm/page-writeback.c 2006-12-19 15:17:53.000000000 +1100 +++ linux-2.6/mm/page-writeback.c 2006-12-19 15:33:29.000000000 +1100 @@ -844,9 +844,10 @@ EXPORT_SYMBOL(set_page_dirty_lock); /* * Clear a page's dirty flag, while caring for dirty memory accounting. + * Does not clear pte dirty bits. * Returns true if the page was previously dirty. */ -int test_clear_page_dirty(struct page *page) +static int test_clear_page_dirty_leave_ptes(struct page *page) { struct address_space *mapping = page_mapping(page); unsigned long flags; @@ -862,10 +863,8 @@ int test_clear_page_dirty(struct page *p * We can continue to use `mapping' here because the * page is locked, which pins the address_space */ - if (mapping_cap_account_dirty(mapping)) { - page_mkclean(page); + if (mapping_cap_account_dirty(mapping)) dec_zone_page_state(page, NR_FILE_DIRTY); - } return 1; } write_unlock_irqrestore(&mapping->tree_lock, flags); @@ -873,9 +872,43 @@ int test_clear_page_dirty(struct page *p } return TestClearPageDirty(page); } + +/* + * As above, but does clear dirty bits from ptes + */ +int test_clear_page_dirty(struct page *page) +{ + struct address_space *mapping = page_mapping(page); + + if (test_clear_page_dirty_leave_ptes(page)) { + if (mapping_cap_account_dirty(mapping)) + page_mkclean(page); + return 1; + } + return 0; +} EXPORT_SYMBOL(test_clear_page_dirty); /* + * As above, but redirties page if any dirty ptes are found (and then only + * if the mapping accounts dirty pages, otherwise dirty ptes are left dirty + * but the page is cleaned). + */ +int test_clear_page_dirty_sync_ptes(struct page *page) +{ + struct address_space *mapping = page_mapping(page); + + if (test_clear_page_dirty_leave_ptes(page)) { + if (mapping_cap_account_dirty(mapping)) { + if (page_mkclean(page)) + set_page_dirty(page); + } + return 1; + } + return 0; +} + +/* * Clear a page's dirty flag, while caring for dirty memory accounting. * Returns true if the page was previously dirty. * ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-19 4:36 ` Nick Piggin @ 2006-12-19 6:34 ` Linus Torvalds 2006-12-19 6:51 ` Nick Piggin 2006-12-19 20:03 ` dean gaudet 2006-12-19 7:22 ` Peter Zijlstra 1 sibling, 2 replies; 311+ messages in thread From: Linus Torvalds @ 2006-12-19 6:34 UTC (permalink / raw) To: Nick Piggin Cc: Peter Zijlstra, Andrew Morton, andrei.popa, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Tue, 19 Dec 2006, Nick Piggin wrote: > > We never want to drop dirty data! (ignoring the truncate case, which is > handled privately by truncate anyway) Bzzt. SURE we do. We absolutely do want to drop dirty data in the writeout path. How do you think dirty data ever _becomes_ clean data? In other words, yes, we _do_ want to test-and-clear all the pgtable bits _and_ the PG_dirty bit. We want to do it for: - writeout - truncate - possibly a "drop" event (which could be a case for a journal entry that becomes stale due to being replaced or something - kind of "truncate" on metadata) because both of those events _literally_ turn dirty state into clean state. In no other circumstance do we ever want to clear a dirty bit, as far as I can tell. Linus ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-19 6:34 ` Linus Torvalds @ 2006-12-19 6:51 ` Nick Piggin 2006-12-19 7:26 ` Linus Torvalds 2006-12-19 20:03 ` dean gaudet 1 sibling, 1 reply; 311+ messages in thread From: Nick Piggin @ 2006-12-19 6:51 UTC (permalink / raw) To: Linus Torvalds Cc: Peter Zijlstra, Andrew Morton, andrei.popa, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr Linus Torvalds wrote: > > On Tue, 19 Dec 2006, Nick Piggin wrote: > >>We never want to drop dirty data! (ignoring the truncate case, which is >>handled privately by truncate anyway) > > > Bzzt. > > SURE we do. > > We absolutely do want to drop dirty data in the writeout path. > > How do you think dirty data ever _becomes_ clean data? I wouldn't have thought it becomes clean by dropping it ;) Is this a trick question? My answer is that we clean a page by by taking some action such that the underlying data matches the data in RAM... We don't "drop" any data until it has been cleaned (again, ignoring things like truncate for a minute). That's a bug! And try_to_free_buffers() is called from places outside the writeout path. This is our bug (or at least, one of our bugs that appears to have the same triggers and symptoms as people are reporting). [...] > In no other circumstance do we ever want to clear a dirty bit, as far as I > can tell. Exactly. And that is exactly what try_to_free_buffers is doing now. I still think you should have a look at the patch. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-19 6:51 ` Nick Piggin @ 2006-12-19 7:26 ` Linus Torvalds 2006-12-19 8:04 ` Linus Torvalds 0 siblings, 1 reply; 311+ messages in thread From: Linus Torvalds @ 2006-12-19 7:26 UTC (permalink / raw) To: Nick Piggin Cc: Peter Zijlstra, Andrew Morton, andrei.popa, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Tue, 19 Dec 2006, Nick Piggin wrote: > > I wouldn't have thought it becomes clean by dropping it ;) Is this a > trick question? My answer is that we clean a page by by taking some > action such that the underlying data matches the data in RAM... Sure. > We don't "drop" any data until it has been cleaned (again, ignoring > things like truncate for a minute). That's a bug! Actually, it's the other way around. We have to drop the dirty bits BEFORE cleaning. If we clean first, and _then_ drop the dirty bits, THAT is a bug, because the dirty bits can now refer to _new_ dirty data that didn't get written out. So the proper sequence is _literally_ to mark the page clean FIRST. Drop all the dirty bits, but not the _data_ obviously (ie you have a reference to the page). And _then_ you do the writeout to actually clean the data itself. So you actually state it exactly the wrogn way around. We MUST clear the dirty bits before we do the IO that actually cleans the data. Exactly because if new writes keep on happening, if we do it in the other order, we'll drop dirty data on the floor. > > In no other circumstance do we ever want to clear a dirty bit, as far as I > > can tell. > > Exactly. And that is exactly what try_to_free_buffers is doing now. > > I still think you should have a look at the patch. I claim that dropping dirty bits AFTER the IO is always wrong. Try_to_free_buffers() must never touch the dirty bits at all, because by definition that thing happens after the IO has actually been done. Anbd yes, I looked at your patch. And it looks a million times cleaner than Andrew's patch. However, it's already been tested multiple times, and totally REMOVING the "clear_page_dirty()" from try_to_free_buffers() still resulted in the corruption. That said, I think your patch is worth it just as a cleanup. Much nicer than Andrews code, also from a naming standpoint. So I'm not actually disagreeing about the patch itself, but I _am_ saying that I don't actually see the point of ever moving the dirty bits around. So I repeat: we have the case where we really want to _remove_ the dirty bits (because we're going to write the current state of the page to disk, and we need to clear the dirty bits BEFORE we do that). That's the one that makes sense, and that's the code we want to run before doing IO. It's the "clear_dirty_bits_for_io()" case. The code that doesn't make sense is the "shuffle the dirty bits around" In other words: when does it actually make sense to call your (well-implemented, don't get me wrong) "test_clear_page_dirty_sync_ptes()" function? It doesn't _fix_ anything. It just shuffles dirty bits from one place to another. What was the point again? If the point is "try_to_free_buffers()", then my argument was that I had a much simpler solution: "Just don't do that then". My simple patch sadly didn't fix the data corruption, so the data corruption comes from something ELSE than try_to_free_buffers(). Linus ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-19 7:26 ` Linus Torvalds @ 2006-12-19 8:04 ` Linus Torvalds 2006-12-19 9:00 ` Peter Zijlstra [not found] ` <4587B762.2030603@yahoo.com.au> 0 siblings, 2 replies; 311+ messages in thread From: Linus Torvalds @ 2006-12-19 8:04 UTC (permalink / raw) To: Nick Piggin Cc: Peter Zijlstra, Andrew Morton, andrei.popa, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Mon, 18 Dec 2006, Linus Torvalds wrote: > > The code that doesn't make sense is the "shuffle the dirty bits around" In > other words: when does it actually make sense to call your > (well-implemented, don't get me wrong) "test_clear_page_dirty_sync_ptes()" > function? It doesn't _fix_ anything. It just shuffles dirty bits from one > place to another. What was the point again? Let me try to phrase that another way, in terms that you defined. In other words, look at your test_clear_page_dirty_sync_ptes() function. First, start out from the _inner_ part, the: if (mapping_cap_account_dirty(mapping)) { if (page_mkclean(page)) set_page_dirty(page); } part. This the one that both you and I agree is a "working" situation: we are moving the dirty bits from the pte into the "struct page", and we both agree that this is fine. No dirty bits get lost. You even make a BIG DEAL about the fact that no dirty bits get lost. So begin by just explaining: - why do it? Why shuffle the dirty bits around? Why not just _leave_ the PG_dirty bit on the "struct page", and simply leave it all at that? I agree that the above doesn't lose any dirty bits, but what I'm asking for is WHAT IS THE POINT? So that is the code that we both agree "works", but I personally don't see the _point_ of. However, that's actually not even important, because I don't even care about the point. I wanted to bring that up just in order to then ignore it, and look at the stuff _around: it, namely the other part in "test_clear_page_dirty_sync_ptes()": int test_clear_page_dirty_sync_ptes(struct page *page) { if (test_clear_page_dirty_leave_ptes(page)) { .. do the inner part .. return 1; } return 0; } Now, the above is the OUTER part. Please realize that this DOES actually drop the PG-dirty bit. So ignore the inner part entirely (which is a no-op for the case where the page isn't mapped), and explain to me why it's ok to DROP the dirty bit in the outer part, when you tried to say that it was NOT ok to drop it in the inner part? NOTICE? First you make a BIG DEAL about how dirty bits should never get lost, but THE VERY SAME FUNCTION actually very much on purpose DOES drop the dirty bit for when it's not in the page tables. In fact, if you just call that function twice, the first time it will MOVE the dirty bits from the PTE to the "struct page *", and the _second_ time it will just clear the dirty bit from the "struct page *". You end up with a clean page. It returned the same return value BOTH TIMES, even though it did two very different things (once just moving dirty bits around, and the second time actually _removing_ the dirty bit entirely). Again, I have a very simple claim: I claim that NONE of the "test_clear_page_dirty()" functions make any sense what-so-ever. They're all wrong. The "funny" part is, that the only thing that Andrei reports actually fixed his corruption (apart from the patch tjhat just stops removign the dirty bits from the PTE's _entirely_) is actually the part where he had an "#if 0 .. #endif" around basically _all_ of the "test_clear_page_dirty()" function (ie he had mis-understood what I asked for, and put it outside the _outer_ if(), rather than putting it around the inner one). So I claim: - there is ONE and only ONE place where you can really drop the dirty bits: it's when you're going to immediately afterwards do a writeout. This is the "clear_page_dirty_for_io()" - all the other "[test_and_]clear_dirty*()" functions seem to be outright buggy and bogus. Shuffling dirty bits around from the page tables to the "struct page *" (after having _cleared_ that "very important" PG_dirty bit just before - apparently it wasn't that important after all, was it?) is insane. Nobody has actually ever explained why "test_clear_page_dirty()" is good at all. - Why is it ever used instead of "clear_page_dirty_for_io()"? - What is the difference? - Why would you EVER want to clear bits just in the "struct page *" or just in the PTE's? - Why is it EVER correct to clear dirty bits except JUST BEFORE THE IO? In other words, I have a theory: "A lot of this is actually historical cruft. Some of it may even be code that was never supposed to work, but because we maintained _other_ dirty bits in the PTE's, and never touched them before, we never even realized that the code that played with PG_dirty was totally insane" Now, that's just a theory. And yeah, it may be stated a bit provocatively. It may not be entirely correct. I'm just saying.. maybe it is? And yes, we actually really _do_ have a data-point from Andrei that says that if you just make "test_clear_page_dirty()" a no-op, the corruption goes away. It was unintentional, bit hey, it's a real datapoint. See the email from Andrei: Subject: Re: 2.6.19 file content corruption on ext3 From: Andrei Popa <andrei.popa@i-neo.ro> Date: Tue, 19 Dec 2006 01:48:11 +0200 Message-Id: <1166485691.6977.6.camel@localhost> and look at what remains of his "test_clear_page_dirty()". Scary, isn't it? And a big hint that "test_clear_page_dirty()" is just totally BOGUS. And the thing is, I think it's bogus just because I don't understand why it would EVER be ok to drop those dirty bits _except_ very much just before doing the IO that makes it non-dirty (where "truncate()" is really a special case where the IO ends up being not done, but it's the same kind of situation). Linus ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-19 8:04 ` Linus Torvalds @ 2006-12-19 9:00 ` Peter Zijlstra 2006-12-19 9:05 ` Peter Zijlstra [not found] ` <4587B762.2030603@yahoo.com.au> 1 sibling, 1 reply; 311+ messages in thread From: Peter Zijlstra @ 2006-12-19 9:00 UTC (permalink / raw) To: Linus Torvalds Cc: Nick Piggin, Andrew Morton, andrei.popa, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Tue, 2006-12-19 at 00:04 -0800, Linus Torvalds wrote: > Nobody has actually ever explained why "test_clear_page_dirty()" is good > at all. > > - Why is it ever used instead of "clear_page_dirty_for_io()"? > > - What is the difference? > > - Why would you EVER want to clear bits just in the "struct page *" or > just in the PTE's? > > - Why is it EVER correct to clear dirty bits except JUST BEFORE THE IO? > > In other words, I have a theory: > > "A lot of this is actually historical cruft. Some of it may even be code > that was never supposed to work, but because we maintained _other_ dirty > bits in the PTE's, and never touched them before, we never even realized > that the code that played with PG_dirty was totally insane" > > Now, that's just a theory. And yeah, it may be stated a bit provocatively. > It may not be entirely correct. I'm just saying.. maybe it is? On Sun, 2006-12-17 at 15:40 -0800, Andrew Morton wrote: > try_to_free_buffers() clears the page's dirty state if it successfully removed > the page's buffers. > > Background for this: > > - a process does a one-byte-write to a file on a 64k pagesize, 4k > blocksize ext3 filesystem. The page is now PageDirty, !PgeUptodate and > has one dirty buffer and 15 not uptodate buffers. > > - kjournald writes the dirty buffer. The page is now PageDirty, > !PageUptodate and has a mix of clean and not uptodate buffers. > > - try_to_free_buffers() removes the page's buffers. It MUST now clear > PageDirty. If we were to leave the page dirty then we'd have a dirty, not > uptodate page with no buffer_heads. > > We're screwed: we cannot write the page because we don't know which > sections of it contain garbage. We cannot read the page because we don't > know which sections of it contain modified data. We cannot free the page > because it is dirty. However!! this is not true for mapped pages because mapped pages must have the whole (16k in akpm's example) page loaded. Hence I suspect that what Andrei did by accident - remove the if (mapping) case in test_clean_dirty_pages() - is actually totally correct. ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-19 9:00 ` Peter Zijlstra @ 2006-12-19 9:05 ` Peter Zijlstra 0 siblings, 0 replies; 311+ messages in thread From: Peter Zijlstra @ 2006-12-19 9:05 UTC (permalink / raw) To: Linus Torvalds Cc: Nick Piggin, Andrew Morton, andrei.popa, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Tue, 2006-12-19 at 10:00 +0100, Peter Zijlstra wrote: > On Tue, 2006-12-19 at 00:04 -0800, Linus Torvalds wrote: > > > Nobody has actually ever explained why "test_clear_page_dirty()" is good > > at all. > > > > - Why is it ever used instead of "clear_page_dirty_for_io()"? > > > > - What is the difference? > > > > - Why would you EVER want to clear bits just in the "struct page *" or > > just in the PTE's? > > > > - Why is it EVER correct to clear dirty bits except JUST BEFORE THE IO? > > > > In other words, I have a theory: > > > > "A lot of this is actually historical cruft. Some of it may even be code > > that was never supposed to work, but because we maintained _other_ dirty > > bits in the PTE's, and never touched them before, we never even realized > > that the code that played with PG_dirty was totally insane" > > > > Now, that's just a theory. And yeah, it may be stated a bit provocatively. > > It may not be entirely correct. I'm just saying.. maybe it is? > > On Sun, 2006-12-17 at 15:40 -0800, Andrew Morton wrote: > > > try_to_free_buffers() clears the page's dirty state if it successfully removed > > the page's buffers. > > > > Background for this: > > > > - a process does a one-byte-write to a file on a 64k pagesize, 4k > > blocksize ext3 filesystem. The page is now PageDirty, !PgeUptodate and > > has one dirty buffer and 15 not uptodate buffers. > > > > - kjournald writes the dirty buffer. The page is now PageDirty, > > !PageUptodate and has a mix of clean and not uptodate buffers. > > > > - try_to_free_buffers() removes the page's buffers. It MUST now clear > > PageDirty. If we were to leave the page dirty then we'd have a dirty, not > > uptodate page with no buffer_heads. > > > > We're screwed: we cannot write the page because we don't know which > > sections of it contain garbage. We cannot read the page because we don't > > know which sections of it contain modified data. We cannot free the page > > because it is dirty. > > However!! this is not true for mapped pages because mapped pages must > have the whole (16k in akpm's example) page loaded. Hence I suspect that > what Andrei did by accident - remove the if (mapping) case in > test_clean_dirty_pages() - is actually totally correct. Obviously I need my morning shot, 64k ofcourse. ^ permalink raw reply [flat|nested] 311+ messages in thread
[parent not found: <4587B762.2030603@yahoo.com.au>]
* Re: 2.6.19 file content corruption on ext3 [not found] ` <4587B762.2030603@yahoo.com.au> @ 2006-12-19 10:32 ` Andrew Morton 2006-12-19 10:42 ` Nick Piggin ` (3 more replies) 2006-12-19 16:51 ` Linus Torvalds 1 sibling, 4 replies; 311+ messages in thread From: Andrew Morton @ 2006-12-19 10:32 UTC (permalink / raw) To: Nick Piggin Cc: Linus Torvalds, Peter Zijlstra, andrei.popa, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Tue, 19 Dec 2006 20:56:50 +1100 Nick Piggin <nickpiggin@yahoo.com.au> wrote: > Linus Torvalds wrote: > > > NOTICE? First you make a BIG DEAL about how dirty bits should never get > > lost, but THE VERY SAME FUNCTION actually very much on purpose DOES drop > > the dirty bit for when it's not in the page tables. > > try_to_free_buffers is quite a special case, where we're transferring > the page dirty metadata from the buffers to the page. I think Andrew > would have a better grasp of it so he could correct me, but what it > does is legitimate. Well it used to be. After 2.6.19 it can do the wrong thing for mapped pages. But it turns out that we don't feed it mapped pages, apart from pagevec_strip() and possibly races against pagefaults. > I think it could be very likely that indeed the bug is a latent one in > a clear_page_dirty caller, rather than dirty-tracking itself. The only callers are try_to_free_buffers(), truncate and a few scruffy possibly-wrong-for-fsync filesytems which aren't being used here. <spots a race in do_no_page()> If a write-fault races with a read-fault and the write-fault loses, we forget to mark the page dirty. Something like this, but it's probably wrong - I didn't try very hard (am feeling ill, and vaguely grumpy) From: Andrew Morton <akpm@osdl.org> Signed-off-by: Andrew Morton <akpm@osdl.org> --- mm/memory.c | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff -puN mm/memory.c~a mm/memory.c --- a/mm/memory.c~a +++ a/mm/memory.c @@ -2264,10 +2264,22 @@ retry: } } else { /* One of our sibling threads was faster, back out. */ + if (write_access) { + /* + * We might have raced against a read-fault. We still + * need to dirty the page. + */ + dirty_page = vm_normal_page(vma, address, *page_table); + if (dirty_page) { + get_page(dirty_page); + goto dirty_it; + } + } page_cache_release(new_page); goto unlock; } +dirty_it: /* no need to invalidate: a not-present page shouldn't be cached */ update_mmu_cache(vma, address, entry); lazy_mmu_prot_update(entry); _ ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-19 10:32 ` Andrew Morton @ 2006-12-19 10:42 ` Nick Piggin 2006-12-19 10:47 ` Andrew Morton ` (2 subsequent siblings) 3 siblings, 0 replies; 311+ messages in thread From: Nick Piggin @ 2006-12-19 10:42 UTC (permalink / raw) To: Andrew Morton Cc: Linus Torvalds, Peter Zijlstra, andrei.popa, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr Andrew Morton wrote: > On Tue, 19 Dec 2006 20:56:50 +1100 > Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > >>Linus Torvalds wrote: >> >> >>>NOTICE? First you make a BIG DEAL about how dirty bits should never get >>>lost, but THE VERY SAME FUNCTION actually very much on purpose DOES drop >>>the dirty bit for when it's not in the page tables. >> >>try_to_free_buffers is quite a special case, where we're transferring >>the page dirty metadata from the buffers to the page. I think Andrew >>would have a better grasp of it so he could correct me, but what it >>does is legitimate. > > > Well it used to be. After 2.6.19 it can do the wrong thing for mapped > pages. Yes, that is what I was trying to get at. > But it turns out that we don't feed it mapped pages, apart from > pagevec_strip() and possibly races against pagefaults. True, and I think we have pretty well established that this isn't the cause of Andrei's problem, but I think we all agree it is *a* bug? And surely Andrei's data corruption will be of the same flavour in that test_clear_page_dirty somewhere is now stripping pte dirty bits where it shouldn't? (because it went away after Peter nooped that behaviour) >>I think it could be very likely that indeed the bug is a latent one in >>a clear_page_dirty caller, rather than dirty-tracking itself. > > > The only callers are try_to_free_buffers(), truncate and a few scruffy > possibly-wrong-for-fsync filesytems which aren't being used here. > > > <spots a race in do_no_page()> > > If a write-fault races with a read-fault and the write-fault loses, we forget > to mark the page dirty. Hmm.. in that case will the pte still be readonly, and thus the write faulter will have to try again I think? > > Something like this, but it's probably wrong - I didn't try very hard (am > feeling ill, and vaguely grumpy) > > > From: Andrew Morton <akpm@osdl.org> > > Signed-off-by: Andrew Morton <akpm@osdl.org> > --- > > mm/memory.c | 12 ++++++++++++ > 1 file changed, 12 insertions(+) > > diff -puN mm/memory.c~a mm/memory.c > --- a/mm/memory.c~a > +++ a/mm/memory.c > @@ -2264,10 +2264,22 @@ retry: > } > } else { > /* One of our sibling threads was faster, back out. */ > + if (write_access) { > + /* > + * We might have raced against a read-fault. We still > + * need to dirty the page. > + */ > + dirty_page = vm_normal_page(vma, address, *page_table); > + if (dirty_page) { > + get_page(dirty_page); > + goto dirty_it; > + } > + } > page_cache_release(new_page); > goto unlock; > } > > +dirty_it: > /* no need to invalidate: a not-present page shouldn't be cached */ > update_mmu_cache(vma, address, entry); > lazy_mmu_prot_update(entry); > _ > > -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-19 10:32 ` Andrew Morton 2006-12-19 10:42 ` Nick Piggin @ 2006-12-19 10:47 ` Andrew Morton 2006-12-19 10:52 ` Peter Zijlstra 2006-12-19 10:55 ` Nick Piggin 3 siblings, 0 replies; 311+ messages in thread From: Andrew Morton @ 2006-12-19 10:47 UTC (permalink / raw) To: Nick Piggin, Linus Torvalds, Peter Zijlstra, andrei.popa, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Tue, 19 Dec 2006 02:32:55 -0800 Andrew Morton <akpm@osdl.org> wrote: > <spots a race in do_no_page()> > > If a write-fault races with a read-fault and the write-fault loses, we forget > to mark the page dirty. No that isn't right, is it. The writer just retakes the fault and all the right things happen. Ho hum. ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-19 10:32 ` Andrew Morton 2006-12-19 10:42 ` Nick Piggin 2006-12-19 10:47 ` Andrew Morton @ 2006-12-19 10:52 ` Peter Zijlstra 2006-12-19 10:58 ` Nick Piggin 2006-12-19 10:55 ` Nick Piggin 3 siblings, 1 reply; 311+ messages in thread From: Peter Zijlstra @ 2006-12-19 10:52 UTC (permalink / raw) To: Andrew Morton Cc: Nick Piggin, Linus Torvalds, andrei.popa, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Tue, 2006-12-19 at 02:32 -0800, Andrew Morton wrote: > On Tue, 19 Dec 2006 20:56:50 +1100 > Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > > Linus Torvalds wrote: > > > > > NOTICE? First you make a BIG DEAL about how dirty bits should never get > > > lost, but THE VERY SAME FUNCTION actually very much on purpose DOES drop > > > the dirty bit for when it's not in the page tables. > > > > try_to_free_buffers is quite a special case, where we're transferring > > the page dirty metadata from the buffers to the page. I think Andrew > > would have a better grasp of it so he could correct me, but what it > > does is legitimate. > > Well it used to be. After 2.6.19 it can do the wrong thing for mapped > pages. But it turns out that we don't feed it mapped pages, apart from > pagevec_strip() and possibly races against pagefaults. So how about this: Index: linux-2.6-git/mm/page-writeback.c =================================================================== --- linux-2.6-git.orig/mm/page-writeback.c 2006-12-19 08:24:48.000000000 +0100 +++ linux-2.6-git/mm/page-writeback.c 2006-12-19 11:43:31.000000000 +0100 @@ -859,6 +859,9 @@ int test_clear_page_dirty(struct page *p struct address_space *mapping = page_mapping(page); unsigned long flags; + if (page_mapped(page)) + return 0; + if (!mapping) return TestClearPageDirty(page); ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-19 10:52 ` Peter Zijlstra @ 2006-12-19 10:58 ` Nick Piggin 2006-12-19 11:51 ` Peter Zijlstra 0 siblings, 1 reply; 311+ messages in thread From: Nick Piggin @ 2006-12-19 10:58 UTC (permalink / raw) To: Peter Zijlstra Cc: Andrew Morton, Linus Torvalds, andrei.popa, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr Peter Zijlstra wrote: > On Tue, 2006-12-19 at 02:32 -0800, Andrew Morton wrote: >>Well it used to be. After 2.6.19 it can do the wrong thing for mapped >>pages. But it turns out that we don't feed it mapped pages, apart from >>pagevec_strip() and possibly races against pagefaults. > > > So how about this: Well that's still racy. Anyway several earlier patches (including the one I posted) closed this race. Some were still reported to trigger corruption IIRC. > Index: linux-2.6-git/mm/page-writeback.c > =================================================================== > --- linux-2.6-git.orig/mm/page-writeback.c 2006-12-19 08:24:48.000000000 +0100 > +++ linux-2.6-git/mm/page-writeback.c 2006-12-19 11:43:31.000000000 +0100 > @@ -859,6 +859,9 @@ int test_clear_page_dirty(struct page *p > struct address_space *mapping = page_mapping(page); > unsigned long flags; > > + if (page_mapped(page)) > + return 0; > + > if (!mapping) > return TestClearPageDirty(page); > > > > - -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-19 10:58 ` Nick Piggin @ 2006-12-19 11:51 ` Peter Zijlstra 0 siblings, 0 replies; 311+ messages in thread From: Peter Zijlstra @ 2006-12-19 11:51 UTC (permalink / raw) To: Nick Piggin Cc: Andrew Morton, Linus Torvalds, andrei.popa, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Tue, 2006-12-19 at 21:58 +1100, Nick Piggin wrote: > Peter Zijlstra wrote: > > On Tue, 2006-12-19 at 02:32 -0800, Andrew Morton wrote: > > >>Well it used to be. After 2.6.19 it can do the wrong thing for mapped > >>pages. But it turns out that we don't feed it mapped pages, apart from > >>pagevec_strip() and possibly races against pagefaults. > > > > > > So how about this: > > Well that's still racy. Anyway several earlier patches (including > the one I posted) closed this race. Some were still reported to > trigger corruption IIRC. I can't remember a patch that removes mapped pages from this code path, however I could have missed it. All out removing the mapping branch in ttfb() did also fix the problem - which is a superset of page_mapped(). I'm now building a kernel with this patch, and will submit that to rtorrent with mem=256M on a 1k ext3 filesystem on x86_64 smp preempt. --- fs/buffer.c | 32 +++++++++++++++++++++++++++++++- 1 file changed, 31 insertions(+), 1 deletion(-) Index: linux-2.6/fs/buffer.c =================================================================== --- linux-2.6.orig/fs/buffer.c +++ linux-2.6/fs/buffer.c @@ -2798,11 +2798,38 @@ static inline int buffer_busy(struct buf (bh->b_state & ((1 << BH_Dirty) | (1 << BH_Lock))); } +/* + * AKPM sayeth: + * + * - a process does a one-byte-write to a file on a 64k pagesize, 4k + * blocksize ext3 filesystem. The page is now PageDirty, !PgeUptodate and + * has one dirty buffer and 15 not uptodate buffers. + * + * - kjournald writes the dirty buffer. The page is now PageDirty, + * !PageUptodate and has a mix of clean and not uptodate buffers. + * + * - try_to_free_buffers() removes the page's buffers. It MUST now clear + * PageDirty. If we were to leave the page dirty then we'd have a dirty, not + * uptodate page with no buffer_heads. + * + * We're screwed: we cannot write the page because we don't know which + * sections of it contain garbage. We cannot read the page because we don't + * know which sections of it contain modified data. We cannot free the page + * because it is dirty. + * + * However for mapped pages this is not true; mapped pages will be fully + * loaded and thus cannot have not uptodate buffers. + * + * Hence allow the PG_dirty bit to stay for pages that had no not uptodate + * buffers (and assert that mapped pages never have those). + */ + static int drop_buffers(struct page *page, struct buffer_head **buffers_to_free) { struct buffer_head *head = page_buffers(page); struct buffer_head *bh; + int uptodate = 1; bh = head; do { @@ -2818,11 +2845,14 @@ drop_buffers(struct page *page, struct b if (!list_empty(&bh->b_assoc_buffers)) __remove_assoc_queue(bh); + if (!buffer_uptodate(bh)) + uptodate = 0; bh = next; } while (bh != head); *buffers_to_free = head; __clear_page_buffers(page); - return 1; + VM_BUG_ON(page_mapped(page) && !uptodate); + return !uptodate; failed: return 0; } ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-19 10:32 ` Andrew Morton ` (2 preceding siblings ...) 2006-12-19 10:52 ` Peter Zijlstra @ 2006-12-19 10:55 ` Nick Piggin 3 siblings, 0 replies; 311+ messages in thread From: Nick Piggin @ 2006-12-19 10:55 UTC (permalink / raw) To: Andrew Morton Cc: Linus Torvalds, Peter Zijlstra, andrei.popa, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr Andrew Morton wrote: > On Tue, 19 Dec 2006 20:56:50 +1100 > Nick Piggin <nickpiggin@yahoo.com.au> wrote: >>I think it could be very likely that indeed the bug is a latent one in >>a clear_page_dirty caller, rather than dirty-tracking itself. > > > The only callers are try_to_free_buffers(), truncate and a few scruffy > possibly-wrong-for-fsync filesytems which aren't being used here. Well truncate/invalidate will not operate on mapped pages (barring the very-unlikely truncate/invalidate vs fault races). We can ignore those filesystems as they don't include ext3. Which brings us back to try_to_free_buffers(). Maybe it is something else entirely, but did try_to_free_buffers ever get completely cleared? Or was some of Andrei's corruption possibly leftover on-disk corruption from a previous kernel? -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 [not found] ` <4587B762.2030603@yahoo.com.au> 2006-12-19 10:32 ` Andrew Morton @ 2006-12-19 16:51 ` Linus Torvalds 2006-12-19 17:43 ` Linus Torvalds 1 sibling, 1 reply; 311+ messages in thread From: Linus Torvalds @ 2006-12-19 16:51 UTC (permalink / raw) To: Nick Piggin Cc: Peter Zijlstra, Andrew Morton, andrei.popa, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Tue, 19 Dec 2006, Nick Piggin wrote: > > Counterexample? Well AFAIKS, the clearing of PG_dirty in ttfb() in > response to finding all buffers clean is perfectly valid. What makes > you think otherwise? If the page really is clean, then why the heck cant' we just clean the page table bits too? Either it's clean or it isn't. If all the buffers being clean means that the page is clean, then it's clean. WE SHOULD NOT THINK THAT PTE'S ARE ANY DIFFERENT. I really don't see your point. Is it clean? If it is, then clear the damn dirty bits from the page tables too. Don't go pussyfooting around the issue and confuse yourself and everybody but me by saying "but if it's dirty in the page tables, it's magically dirty". NO. It really is that simple. Is it clean or not? If it's clean, you can remove ALL the dirty bits. Not just some. Linus ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-19 16:51 ` Linus Torvalds @ 2006-12-19 17:43 ` Linus Torvalds 2006-12-19 18:59 ` Linus Torvalds ` (2 more replies) 0 siblings, 3 replies; 311+ messages in thread From: Linus Torvalds @ 2006-12-19 17:43 UTC (permalink / raw) To: Nick Piggin Cc: Peter Zijlstra, Andrew Morton, andrei.popa, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr [-- Attachment #1: Type: TEXT/PLAIN, Size: 4156 bytes --] Btw, here's a totally new tangent on this: it's possible that user code is simply BUGGY. There is one case where the kernel actually forcibly writes zeroes into a file: when we're writing a page that straddles the "inode->i_size" boundary. See the various writepages in fs/buffer.c, they all contain variations on that theme (although most of them aren't as well commented as this snippet): /* * The page straddles i_size. It must be zeroed out on each and every * writepage invocation because it may be mmapped. "A file is mapped * in multiples of the page size. For a file that is not a multiple of * the page size, the remaining memory is zeroed when mapped, and * writes to that region are not written out to the file." */ kaddr = kmap_atomic(page, KM_USER0); memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset); flush_dcache_page(page); kunmap_atomic(kaddr, KM_USER0); Now, this should _matter_ only for user processes that are buggy, and that have written to the page _before_ extending it with ftruncate(). That's definitely a serious bug, but it's one that can do totally undetected depending on when the actual write-out happens. So what I'm saying is that if we end up writing things earlier thanks to the more aggressive dirty-page-management thing in 2.6.19, we might actually just expose a long-time userspace bug that was just a LOT harder to trigger before.. I'm not saying this is the cause of all this, but we've been tearing our hair out, and it migth be worthwhile trying this really really really stupid patch that will notice when that happens at truncate() time, and tell the user that he's a total idiot. Or something to that effect. Maybe the reason this is so easy to trigger with rtorrent is not because rtorrent does some magic pattern that triggers a kernel bug, but simply because rtorrent itself might have a bug. Ok, so it's a long shot, but it's still worth testing, I suspect. The patch is very simple: whenever we do an _expanding_ truncate, we check the last page of the _old_ size, and if there were non-zero contents past the old size, we complain. As an attachement is a test-program that _should_ trigger a kernel message like a.out: BADNESS: truncate check 17000 for good measure, just so that you can verify that the patch works and actually catches this case. (The 17000 number is just the one-hundred _invalid_ 0xaa bytes - out of the 200 we wrote - that were summed up: 100*0xaa == 17000. Anything non-zero is always a bug). I doubt this is really it, but it's worth trying. If you fill out a page, and only do "ftruncate()" in response to SIGBUS messages (and don't truncate to whole pages), you could potentially see zeroes at the end of the page exactly because _writeout_ cleared the page for you! So it _could_ explain the symptoms, but only if user-space was horribly horribly broken. Linus ---- diff --git a/mm/memory.c b/mm/memory.c index c00bac6..79cecab 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1842,6 +1842,33 @@ void unmap_mapping_range(struct address_space *mapping, } EXPORT_SYMBOL(unmap_mapping_range); +static void check_last_page(struct address_space *mapping, loff_t size) +{ + pgoff_t index; + unsigned int offset; + struct page *page; + + if (!mapping) + return; + offset = size & ~PAGE_MASK; + if (!offset) + return; + index = size >> PAGE_SHIFT; + page = find_lock_page(mapping, index); + if (page) { + unsigned int check = 0; + unsigned char *kaddr = kmap_atomic(page, KM_USER0); + do { + check += kaddr[offset++]; + } while (offset < PAGE_SIZE); + kunmap_atomic(kaddr,KM_USER0); + unlock_page(page); + page_cache_release(page); + if (check) + printk("%s: BADNESS: truncate check %u\n", current->comm, check); + } +} + /** * vmtruncate - unmap mappings "freed" by truncate() syscall * @inode: inode of the file used @@ -1875,6 +1902,7 @@ do_expand: goto out_sig; if (offset > inode->i_sb->s_maxbytes) goto out_big; + check_last_page(mapping, inode->i_size); i_size_write(inode, offset); out_truncate: [-- Attachment #2: Type: TEXT/PLAIN, Size: 566 bytes --] #include <sys/mman.h> #include <sys/fcntl.h> #include <unistd.h> #include <string.h> int main(int argc, char **argv) { char *mapping; int fd; fd = open("mapfile", O_RDWR | O_TRUNC | O_CREAT, 0666); if (fd < 0) return -1; if (ftruncate(fd, 10) < 0) return -1; mapping = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); if (-1 == (int)(long)mapping) return -1; memset(mapping, 0x55, 10); if (ftruncate(fd, 100) < 0) return -1; memset(mapping, 0xaa, 200); if (ftruncate(fd, 200) < 0) return -1; return 0; } ^ permalink raw reply related [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-19 17:43 ` Linus Torvalds @ 2006-12-19 18:59 ` Linus Torvalds 2006-12-19 21:30 ` Peter Zijlstra 2006-12-20 5:56 ` Jari Sundell 2006-12-19 21:56 ` Florian Weimer 2006-12-21 13:03 ` Peter Zijlstra 2 siblings, 2 replies; 311+ messages in thread From: Linus Torvalds @ 2006-12-19 18:59 UTC (permalink / raw) To: Nick Piggin Cc: Peter Zijlstra, Andrew Morton, andrei.popa, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Tue, 19 Dec 2006, Linus Torvalds wrote: > > here's a totally new tangent on this: it's possible that user code is > simply BUGGY. Btw, here's a simpler test-program that actually shows the difference between 2.6.18 and 2.6.19 in action, and why it could explain why a program like rtorrent might show corruption behavious that it didn't show before. #include <sys/mman.h> #include <sys/fcntl.h> #include <unistd.h> #include <string.h> int main(int argc, char **argv) { char *mapping; int fd; fd = open("mapfile", O_RDWR | O_TRUNC | O_CREAT, 0666); if (fd < 0) return -1; if (ftruncate(fd, 10) < 0) return -1; mapping = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); if (-1 == (int)(long)mapping) return -1; memset(mapping, 0xaa, 20); sync(); if (ftruncate(fd, 40) < 0) return -1; memset(mapping + 20, 0x55, 20); write(1, mapping, 40); return 0; } Notice the "sync()" in between the "memset()" and the "ftruncate()". In 2.6.18, that would normally do absolutely _nothing_ to the shared memory mapping, becuase we simply couldn't track pages that were dirty in the page tables. So in 2.6.18, if you try this, with ./a.out | od -x you should see something like 0000000 aaaa aaaa aaaa aaaa aaaa aaaa aaaa aaaa 0000020 aaaa aaaa 5555 5555 5555 5555 5555 5555 0000040 5555 5555 5555 5555 0000050 which matches your memset() patterns: 20 bytes of 0xaa, and 20 bytes of 0x55. HOWEVER. In 2.6.19, because we actually track dirty data so much better, "sync()" will actually be smart enough to write out the dirty mmap'ed data too. But since the user program has only allocated ten bytes for it in the file, when it is written out, the rest of the page is cleared. When you then write the last 20 bytes (after _properly_ allocating memory for them), you should now see a pattern like 0000000 aaaa aaaa aaaa aaaa aaaa 0000 0000 0000 0000020 0000 0000 5555 5555 5555 5555 5555 5555 0000040 5555 5555 5555 5555 0000050 instead: with ten bytes of zero in between, because the data that couldn't be written out was cleared. So 2.6.19 is strictly _better_, but exactly because it's tracking dirty status much more precisely, you'll see certain user-level bugs much more easily. NOTE NOTE NOTE! The code really _was_ buggy in 2.6.18 too, and you _can_ get the zeroes in the middle of the file with an older kernel. But in older kernels, you need to be really really unlucky, and have the page cleaned by strong memory pressure. In 2.6.19, any "sync()" activity (includign from the outside) will clean the page, so a user program with this bug can just be made to trigger the bug much more easily. Linus ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-19 18:59 ` Linus Torvalds @ 2006-12-19 21:30 ` Peter Zijlstra 2006-12-19 22:51 ` Linus Torvalds 2006-12-20 18:02 ` Stephen Clark 2006-12-20 5:56 ` Jari Sundell 1 sibling, 2 replies; 311+ messages in thread From: Peter Zijlstra @ 2006-12-19 21:30 UTC (permalink / raw) To: Linus Torvalds Cc: Nick Piggin, Andrew Morton, andrei.popa, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Tue, 2006-12-19 at 10:59 -0800, Linus Torvalds wrote: > > On Tue, 19 Dec 2006, Linus Torvalds wrote: > > > > here's a totally new tangent on this: it's possible that user code is > > simply BUGGY. I'm sad to say this doesn't trigger :-( ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-19 21:30 ` Peter Zijlstra @ 2006-12-19 22:51 ` Linus Torvalds 2006-12-19 22:58 ` Andrew Morton 2006-12-20 18:02 ` Stephen Clark 1 sibling, 1 reply; 311+ messages in thread From: Linus Torvalds @ 2006-12-19 22:51 UTC (permalink / raw) To: Peter Zijlstra Cc: Nick Piggin, Andrew Morton, andrei.popa, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Tue, 19 Dec 2006, Peter Zijlstra wrote: > On Tue, 2006-12-19 at 10:59 -0800, Linus Torvalds wrote: > > > > On Tue, 19 Dec 2006, Linus Torvalds wrote: > > > > > > here's a totally new tangent on this: it's possible that user code is > > > simply BUGGY. > > I'm sad to say this doesn't trigger :-( Oh, well. It was a theory. Linus ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-19 22:51 ` Linus Torvalds @ 2006-12-19 22:58 ` Andrew Morton 2006-12-19 23:06 ` Peter Zijlstra 0 siblings, 1 reply; 311+ messages in thread From: Andrew Morton @ 2006-12-19 22:58 UTC (permalink / raw) To: Linus Torvalds Cc: Peter Zijlstra, Nick Piggin, andrei.popa, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Tue, 19 Dec 2006 14:51:55 -0800 (PST) Linus Torvalds <torvalds@osdl.org> wrote: > > > On Tue, 19 Dec 2006, Peter Zijlstra wrote: > > > On Tue, 2006-12-19 at 10:59 -0800, Linus Torvalds wrote: > > > > > > On Tue, 19 Dec 2006, Linus Torvalds wrote: > > > > > > > > here's a totally new tangent on this: it's possible that user code is > > > > simply BUGGY. > > > > I'm sad to say this doesn't trigger :-( > > Oh, well. It was a theory. > Well... we'd need to see (corruption && this-not-triggering) to be sure. Peter, have you been able to trigger the corruption? ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-19 22:58 ` Andrew Morton @ 2006-12-19 23:06 ` Peter Zijlstra 2006-12-19 23:07 ` Peter Zijlstra 2006-12-20 0:03 ` Linus Torvalds 0 siblings, 2 replies; 311+ messages in thread From: Peter Zijlstra @ 2006-12-19 23:06 UTC (permalink / raw) To: Andrew Morton Cc: Linus Torvalds, Nick Piggin, andrei.popa, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Tue, 2006-12-19 at 14:58 -0800, Andrew Morton wrote: > Well... we'd need to see (corruption && this-not-triggering) to be sure. > > Peter, have you been able to trigger the corruption? Yes; however the mail I send describing that seems to be lost in space. /me quotes from the send folder: > The bad new is, that doesn't help either. The good news is I can > reproduce it. > > What I did to achieve that: > > - get a sizable torrent from legaltorrents.com / or create a torrent > yourself that is around ~600M and has multiple files. > > - start a tracker, and multiple seeds (I used three machines here) > > - pull the torrent on a fourth machine > > the seeding machines don't much matter of course. > > the fourth machine was a dual core x86-64 with an SMP kernel and > PREEMPT, mem=256M (so that the torrent is quite a bit larger and does > require writeout) and I used an ext3 partition with 1k blocks. ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-19 23:06 ` Peter Zijlstra @ 2006-12-19 23:07 ` Peter Zijlstra 2006-12-20 0:03 ` Linus Torvalds 1 sibling, 0 replies; 311+ messages in thread From: Peter Zijlstra @ 2006-12-19 23:07 UTC (permalink / raw) To: Andrew Morton Cc: Linus Torvalds, Nick Piggin, andrei.popa, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Wed, 2006-12-20 at 00:06 +0100, Peter Zijlstra wrote: > On Tue, 2006-12-19 at 14:58 -0800, Andrew Morton wrote: > > > Well... we'd need to see (corruption && this-not-triggering) to be sure. > > > > Peter, have you been able to trigger the corruption? > > Yes; however the mail I send describing that seems to be lost in space. > > /me quotes from the send folder: > > > The bad new is, that doesn't help either. The good news is I can > > reproduce it. > > > > What I did to achieve that: > > > > - get a sizable torrent from legaltorrents.com / or create a torrent > > yourself that is around ~600M and has multiple files. > > > > - start a tracker, and multiple seeds (I used three machines here) > > > > - pull the torrent on a fourth machine > > > > the seeding machines don't much matter of course. > > > > the fourth machine was a dual core x86-64 with an SMP kernel and > > PREEMPT, mem=256M (so that the torrent is quite a bit larger and does > > require writeout) and I used an ext3 partition with 1k blocks. PS. this was a reply to: http://lkml.org/lkml/2006/12/19/121 ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-19 23:06 ` Peter Zijlstra 2006-12-19 23:07 ` Peter Zijlstra @ 2006-12-20 0:03 ` Linus Torvalds 2006-12-20 0:18 ` Andrew Morton 1 sibling, 1 reply; 311+ messages in thread From: Linus Torvalds @ 2006-12-20 0:03 UTC (permalink / raw) To: Peter Zijlstra Cc: Andrew Morton, Nick Piggin, andrei.popa, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Wed, 20 Dec 2006, Peter Zijlstra wrote: > On Tue, 2006-12-19 at 14:58 -0800, Andrew Morton wrote: > > > Well... we'd need to see (corruption && this-not-triggering) to be sure. > > > > Peter, have you been able to trigger the corruption? > > Yes; however the mail I send describing that seems to be lost in space. Btw, can somebody actually explain the mess that is ext3 "dirtying". Ext3 does NOT use __set_page_dirty_buffers. It does static int ext3_journalled_set_page_dirty(struct page *page) { SetPageChecked(page); return __set_page_dirty_nobuffers(page); } and uses that "Checked" bit as a "whole page is dirty" bit (which it tests in "writepage()". You realize what this all means? It means that ANYTHING that actually clears the _real_ dirty bit won't actually be doing anything at all for ext3, because the Checked bit will still stay set, and any IO down the line on that page would totally ignore the dirty bits on the buffer heads and just write out everything. That is "The Mess(tm)". It also basically means that anything that clears the dirty bit without just calling "writepage()" had _better_ call "invalidatepage()" for the whole page, because otherwise the PageChecked bit will never be cleared as far as I can see. Happily, at least ext3 seems to _test_ for that case in the release_page() function, so it appears that we do do this. But this seems to just strengthen my argument: you can NEVER clean a page, unless you (a) do IO on it immediately afterwards (writeback) or (b) invalidate it entirely (truncate). I'd really like to see just those two functions exist. Preferably in a form where you can see easily that we actually follow those rules. Rather than having a confusing set of "clear_page_dirty()" and "test_and_clear_page_dirty()" functions that are called from random places. IOW, I think the "clear_page_dirty_for_io()" is fine (it's case (a)) above, and then we should probably have a "cancel_dirty_page()" function that does all the current clear_page_dirty() but also makes sure that we actually call the invalidate_page() function itself. Hmm? Linus ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-20 0:03 ` Linus Torvalds @ 2006-12-20 0:18 ` Andrew Morton 0 siblings, 0 replies; 311+ messages in thread From: Andrew Morton @ 2006-12-20 0:18 UTC (permalink / raw) To: Linus Torvalds Cc: Peter Zijlstra, Nick Piggin, andrei.popa, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Tue, 19 Dec 2006 16:03:49 -0800 (PST) Linus Torvalds <torvalds@osdl.org> wrote: > > > On Wed, 20 Dec 2006, Peter Zijlstra wrote: > > > On Tue, 2006-12-19 at 14:58 -0800, Andrew Morton wrote: > > > > > Well... we'd need to see (corruption && this-not-triggering) to be sure. > > > > > > Peter, have you been able to trigger the corruption? > > > > Yes; however the mail I send describing that seems to be lost in space. > > Btw, can somebody actually explain the mess that is ext3 "dirtying". > > Ext3 does NOT use __set_page_dirty_buffers. It does > > static int ext3_journalled_set_page_dirty(struct page *page) > { > SetPageChecked(page); > return __set_page_dirty_nobuffers(page); > } > > and uses that "Checked" bit as a "whole page is dirty" bit (which it tests > in "writepage()". This is purely for data=journal, which is rarely used. In journalled-data mode, write(), write-fault, etc are not allowed to dirty the pages and buffers, because the data has to be written to the journal first. After the data has been written to the journal we only then mark buffers (and hence pages) dirty as far as the VFS is concerned. For checkpointing the data back to its real place on the disk. For MAP_SHARED pages ext3 cheats madly and doesn't journal the data at all. In all journalling modes, MAP_SHARED data follows the regular ext2-style handling. Which is a bit of a wart. ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-19 21:30 ` Peter Zijlstra 2006-12-19 22:51 ` Linus Torvalds @ 2006-12-20 18:02 ` Stephen Clark 1 sibling, 0 replies; 311+ messages in thread From: Stephen Clark @ 2006-12-20 18:02 UTC (permalink / raw) To: Peter Zijlstra Cc: Linus Torvalds, Nick Piggin, Andrew Morton, andrei.popa, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr Peter Zijlstra wrote: >On Tue, 2006-12-19 at 10:59 -0800, Linus Torvalds wrote: > > >>On Tue, 19 Dec 2006, Linus Torvalds wrote: >> >> >>> here's a totally new tangent on this: it's possible that user code is >>>simply BUGGY. >>> >>> > >I'm sad to say this doesn't trigger :-( > > >- >To unsubscribe from this list: send the line "unsubscribe linux-kernel" in >the body of a message to majordomo@vger.kernel.org >More majordomo info at http://vger.kernel.org/majordomo-info.html >Please read the FAQ at http://www.tux.org/lkml/ > > > Hi all, I ran it a number of times on 2.6.16-1.2115_FC4 and always got ./a.out | od -x 0000000 aaaa aaaa aaaa aaaa aaaa aaaa aaaa aaaa 0000020 aaaa aaaa 5555 5555 5555 5555 5555 5555 0000040 5555 5555 5555 5555 but running it on 2.6.19-rc5 I always get zeros in the middle. Steve -- "They that give up essential liberty to obtain temporary safety, deserve neither liberty nor safety." (Ben Franklin) "The course of history shows that as a government grows, liberty decreases." (Thomas Jefferson) ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-19 18:59 ` Linus Torvalds 2006-12-19 21:30 ` Peter Zijlstra @ 2006-12-20 5:56 ` Jari Sundell 1 sibling, 0 replies; 311+ messages in thread From: Jari Sundell @ 2006-12-20 5:56 UTC (permalink / raw) To: Linus Torvalds Cc: Nick Piggin, Peter Zijlstra, Andrew Morton, andrei.popa, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On 12/20/06, Linus Torvalds <torvalds@osdl.org> wrote: > On Tue, 19 Dec 2006, Linus Torvalds wrote: > > > > here's a totally new tangent on this: it's possible that user code is > > simply BUGGY. > > Btw, here's a simpler test-program that actually shows the difference > between 2.6.18 and 2.6.19 in action, and why it could explain why a > program like rtorrent might show corruption behavious that it didn't show > before. Kinda late to the discussion, but I guess I could summarize what rtorrent actually does, or should be doing. When downloading a new torrent, it will create the files and truncate them to the final size. It will never call truncate after this and the files will remain sparse until data is downloaded. A 'piece' is mapped to memory using MAP_SHARED, which will be page aligned on single file torrents but unlikely to be so on multi-file torrents. So on multi-file torrents it'll often end up with two mappings overlapping with one page, each of which only write to their own part the page. These will then be sync'ed with MS_ASYNC, or MS_SYNC if low on disk space. After that it might be unmapped, then mapped as read-only. I haven't thought of asking if single file torrents are ok. Rakshasa ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-19 17:43 ` Linus Torvalds 2006-12-19 18:59 ` Linus Torvalds @ 2006-12-19 21:56 ` Florian Weimer 2006-12-21 13:03 ` Peter Zijlstra 2 siblings, 0 replies; 311+ messages in thread From: Florian Weimer @ 2006-12-19 21:56 UTC (permalink / raw) To: Linus Torvalds Cc: Nick Piggin, Peter Zijlstra, Andrew Morton, andrei.popa, Linux Kernel Mailing List, Hugh Dickins, Marc Haber, Martin Michlmayr * Linus Torvalds: > Now, this should _matter_ only for user processes that are buggy, > and that have written to the page _before_ extending it with > ftruncate(). APT seems to properly extend the file before mapping it, by writing a zero byte at the desired position (creating a hole). 24986 open("/var/cache/apt/pkgcache.bin", O_RDWR|O_CREAT|O_TRUNC, 0666) = 6 24986 lseek(6, 12582911, SEEK_SET) = 12582911 24986 write(6, "\0", 1) = 1 24986 mmap(NULL, 12582912, PROT_READ|PROT_WRITE, MAP_SHARED, 6, 0) = 0x2b6578636000 24986 msync(0x2b6578636000, 7464112, MS_SYNC) = 0 24986 msync(0x2b6578636000, 8656, MS_SYNC) = 0 24986 munmap(0x2b6578636000, 12582912) = 0 24986 ftruncate(6, 7464112) = 0 24986 fstat(6, {st_mode=S_IFREG|0644, st_size=7464112, ...}) = 0 24986 mmap(NULL, 7464112, PROT_READ, MAP_SHARED, 6, 0) = 0x2b6578636000 APT's code is pretty convoluted, though, and there might be some code path in it that gets it wrong. 8-P ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-19 17:43 ` Linus Torvalds 2006-12-19 18:59 ` Linus Torvalds 2006-12-19 21:56 ` Florian Weimer @ 2006-12-21 13:03 ` Peter Zijlstra 2006-12-21 20:40 ` Andrew Morton 2 siblings, 1 reply; 311+ messages in thread From: Peter Zijlstra @ 2006-12-21 13:03 UTC (permalink / raw) To: Linus Torvalds Cc: Nick Piggin, Andrew Morton, andrei.popa, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Tue, 2006-12-19 at 09:43 -0800, Linus Torvalds wrote: > > Btw, > here's a totally new tangent on this: it's possible that user code is > simply BUGGY. depmod: BADNESS: written outside isize 22183 --- diff --git a/fs/buffer.c b/fs/buffer.c index d1f1b54..5db9fd9 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -2393,6 +2393,17 @@ int nobh_commit_write(struct file *file, struct page *page, } EXPORT_SYMBOL(nobh_commit_write); +static void __check_tail_zero(char *kaddr, unsigned int offset) +{ + unsigned int check = 0; + do { + check += kaddr[offset++]; + } while (offset < PAGE_CACHE_SIZE); + if (check) + printk(KERN_ERR "%s: BADNESS: written outside isize %u\n", + current->comm, check); +} + /* * nobh_writepage() - based on block_full_write_page() except * that it tries to operate without attaching bufferheads to @@ -2437,6 +2448,7 @@ int nobh_writepage(struct page *page, get_block_t *get_block, * writes to that region are not written out to the file." */ kaddr = kmap_atomic(page, KM_USER0); + __check_tail_zero(kaddr, offset); memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset); flush_dcache_page(page); kunmap_atomic(kaddr, KM_USER0); @@ -2604,6 +2616,7 @@ int block_write_full_page(struct page *page, get_block_t *get_block, * writes to that region are not written out to the file." */ kaddr = kmap_atomic(page, KM_USER0); + __check_tail_zero(kaddr, offset); memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset); flush_dcache_page(page); kunmap_atomic(kaddr, KM_USER0); ^ permalink raw reply related [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-21 13:03 ` Peter Zijlstra @ 2006-12-21 20:40 ` Andrew Morton 0 siblings, 0 replies; 311+ messages in thread From: Andrew Morton @ 2006-12-21 20:40 UTC (permalink / raw) To: Peter Zijlstra Cc: Linus Torvalds, Nick Piggin, andrei.popa, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Thu, 21 Dec 2006 14:03:20 +0100 Peter Zijlstra <a.p.zijlstra@chello.nl> wrote: > On Tue, 2006-12-19 at 09:43 -0800, Linus Torvalds wrote: > > > > Btw, > > here's a totally new tangent on this: it's possible that user code is > > simply BUGGY. > > depmod: BADNESS: written outside isize 22183 akpm:/usr/src/module-init-tools-3.3-pre1> grep -r mmap . ./zlibsupport.c: map = mmap(0, *size, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, 0); So presumably it's in a library. akpm:/usr/src/25> ldd /sbin/depmod linux-gate.so.1 => (0xffffe000) libc.so.6 => /lib/tls/i686/cmov/libc.so.6 (0x46afa000) /lib/ld-linux.so.2 (0x4631d000) worrisome. ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-19 6:34 ` Linus Torvalds 2006-12-19 6:51 ` Nick Piggin @ 2006-12-19 20:03 ` dean gaudet 1 sibling, 0 replies; 311+ messages in thread From: dean gaudet @ 2006-12-19 20:03 UTC (permalink / raw) To: Linus Torvalds Cc: Nick Piggin, Peter Zijlstra, Andrew Morton, andrei.popa, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Mon, 18 Dec 2006, Linus Torvalds wrote: > On Tue, 19 Dec 2006, Nick Piggin wrote: > > > > We never want to drop dirty data! (ignoring the truncate case, which is > > handled privately by truncate anyway) > > Bzzt. > > SURE we do. > > We absolutely do want to drop dirty data in the writeout path. > > How do you think dirty data ever _becomes_ clean data? > > In other words, yes, we _do_ want to test-and-clear all the pgtable bits > _and_ the PG_dirty bit. We want to do it for: > - writeout > - truncate > - possibly a "drop" event (which could be a case for a journal entry that > becomes stale due to being replaced or something - kind of "truncate" > on metadata) > > because both of those events _literally_ turn dirty state into clean > state. > > In no other circumstance do we ever want to clear a dirty bit, as far as I > can tell. i admit this may not be entirely relevant, but it seems like a good place to bring up an old problem: when a disk dies with lots of queued writes it can totally bring a system to its knees... even after the disk is removed. i wrote up something about this a while ago: http://lkml.org/lkml/2005/8/18/243 so there's another reason to "clear a dirty bit"... well, in fact -- drop the pages entirely. -dean ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-19 4:36 ` Nick Piggin 2006-12-19 6:34 ` Linus Torvalds @ 2006-12-19 7:22 ` Peter Zijlstra 2006-12-19 7:59 ` Nick Piggin 1 sibling, 1 reply; 311+ messages in thread From: Peter Zijlstra @ 2006-12-19 7:22 UTC (permalink / raw) To: Nick Piggin Cc: Linus Torvalds, Andrew Morton, andrei.popa, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Tue, 2006-12-19 at 15:36 +1100, Nick Piggin wrote: > plain text document attachment (fs-fix.patch) > Index: linux-2.6/fs/buffer.c > =================================================================== > --- linux-2.6.orig/fs/buffer.c 2006-12-19 15:15:46.000000000 +1100 > +++ linux-2.6/fs/buffer.c 2006-12-19 15:36:01.000000000 +1100 > @@ -2852,7 +2852,17 @@ int try_to_free_buffers(struct page *pag > * This only applies in the rare case where try_to_free_buffers > * succeeds but the page is not freed. > */ > - clear_page_dirty(page); > + > + /* > + * If the page has been dirtied via the user mappings, then > + * clean buffers does not indicate the page data is actually > + * clean! Only clear the page dirty bit if there are no dirty > + * ptes either. > + * > + * If there are dirty ptes, then the page must be uptodate, so > + * the above concern does not apply. > + */ > + clear_page_dirty_sync_ptes(page); > } > out: > if (buffers_to_free) { > Index: linux-2.6/include/linux/page-flags.h > =================================================================== > --- linux-2.6.orig/include/linux/page-flags.h 2006-12-19 15:17:18.000000000 +1100 > +++ linux-2.6/include/linux/page-flags.h 2006-12-19 15:34:24.000000000 +1100 > @@ -254,6 +254,7 @@ static inline void SetPageUptodate(struc > struct page; /* forward declaration */ > > int test_clear_page_dirty(struct page *page); > +int test_clear_page_dirty_sync_ptes(struct page *page); > int test_clear_page_writeback(struct page *page); > int test_set_page_writeback(struct page *page); > > @@ -262,6 +263,11 @@ static inline void clear_page_dirty(stru > test_clear_page_dirty(page); > } > > +static inline void clear_page_dirty_sync_ptes(struct page *page) > +{ > + test_clear_page_dirty_sync_ptes(page); > +} > + > static inline void set_page_writeback(struct page *page) > { > test_set_page_writeback(page); > Index: linux-2.6/mm/page-writeback.c > =================================================================== > --- linux-2.6.orig/mm/page-writeback.c 2006-12-19 15:17:53.000000000 +1100 > +++ linux-2.6/mm/page-writeback.c 2006-12-19 15:33:29.000000000 +1100 > @@ -844,9 +844,10 @@ EXPORT_SYMBOL(set_page_dirty_lock); > > /* > * Clear a page's dirty flag, while caring for dirty memory accounting. > + * Does not clear pte dirty bits. > * Returns true if the page was previously dirty. > */ > -int test_clear_page_dirty(struct page *page) > +static int test_clear_page_dirty_leave_ptes(struct page *page) > { > struct address_space *mapping = page_mapping(page); > unsigned long flags; > @@ -862,10 +863,8 @@ int test_clear_page_dirty(struct page *p > * We can continue to use `mapping' here because the > * page is locked, which pins the address_space > */ > - if (mapping_cap_account_dirty(mapping)) { > - page_mkclean(page); > + if (mapping_cap_account_dirty(mapping)) > dec_zone_page_state(page, NR_FILE_DIRTY); > - } > return 1; > } > write_unlock_irqrestore(&mapping->tree_lock, flags); > @@ -873,9 +872,43 @@ int test_clear_page_dirty(struct page *p > } > return TestClearPageDirty(page); > } > + > +/* > + * As above, but does clear dirty bits from ptes > + */ > +int test_clear_page_dirty(struct page *page) > +{ > + struct address_space *mapping = page_mapping(page); > + > + if (test_clear_page_dirty_leave_ptes(page)) { > + if (mapping_cap_account_dirty(mapping)) > + page_mkclean(page); > + return 1; > + } > + return 0; > +} > EXPORT_SYMBOL(test_clear_page_dirty); > > /* > + * As above, but redirties page if any dirty ptes are found (and then only > + * if the mapping accounts dirty pages, otherwise dirty ptes are left dirty > + * but the page is cleaned). > + */ > +int test_clear_page_dirty_sync_ptes(struct page *page) > +{ > + struct address_space *mapping = page_mapping(page); > + > + if (test_clear_page_dirty_leave_ptes(page)) { > + if (mapping_cap_account_dirty(mapping)) { > + if (page_mkclean(page)) > + set_page_dirty(page); > + } > + return 1; > + } > + return 0; > +} > + > +/* > * Clear a page's dirty flag, while caring for dirty memory accounting. > * Returns true if the page was previously dirty. > * Hmm, not quite; It certainly look better than the extra ,[01] tagged to test_clear_page_dirty() though. Although I would have expected it the other way around - test_clear_pages_dirty_sync_ptes to be the default case and test_clear_pages_dirty_clean_ptes to be used in clear_page_dirty_for_io(). Anyway it has the same issues as the others. See what happens when you run two test_clear_page_dirty_sync_ptes() consecutively, you still loose PG_dirty even though the page might actually be dirty. ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-19 7:22 ` Peter Zijlstra @ 2006-12-19 7:59 ` Nick Piggin 2006-12-19 8:14 ` Linus Torvalds 0 siblings, 1 reply; 311+ messages in thread From: Nick Piggin @ 2006-12-19 7:59 UTC (permalink / raw) To: Peter Zijlstra Cc: Linus Torvalds, Andrew Morton, andrei.popa, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr Peter Zijlstra wrote: > On Tue, 2006-12-19 at 15:36 +1100, Nick Piggin wrote: > > >>plain text document attachment (fs-fix.patch) >>Index: linux-2.6/fs/buffer.c >>=================================================================== >>--- linux-2.6.orig/fs/buffer.c 2006-12-19 15:15:46.000000000 +1100 >>+++ linux-2.6/fs/buffer.c 2006-12-19 15:36:01.000000000 +1100 >>@@ -2852,7 +2852,17 @@ int try_to_free_buffers(struct page *pag >> * This only applies in the rare case where try_to_free_buffers >> * succeeds but the page is not freed. >> */ >>- clear_page_dirty(page); >>+ >>+ /* >>+ * If the page has been dirtied via the user mappings, then >>+ * clean buffers does not indicate the page data is actually >>+ * clean! Only clear the page dirty bit if there are no dirty >>+ * ptes either. >>+ * >>+ * If there are dirty ptes, then the page must be uptodate, so >>+ * the above concern does not apply. >>+ */ >>+ clear_page_dirty_sync_ptes(page); >> } >> out: >> if (buffers_to_free) { >>Index: linux-2.6/include/linux/page-flags.h >>=================================================================== >>--- linux-2.6.orig/include/linux/page-flags.h 2006-12-19 15:17:18.000000000 +1100 >>+++ linux-2.6/include/linux/page-flags.h 2006-12-19 15:34:24.000000000 +1100 >>@@ -254,6 +254,7 @@ static inline void SetPageUptodate(struc >> struct page; /* forward declaration */ >> >> int test_clear_page_dirty(struct page *page); >>+int test_clear_page_dirty_sync_ptes(struct page *page); >> int test_clear_page_writeback(struct page *page); >> int test_set_page_writeback(struct page *page); >> >>@@ -262,6 +263,11 @@ static inline void clear_page_dirty(stru >> test_clear_page_dirty(page); >> } >> >>+static inline void clear_page_dirty_sync_ptes(struct page *page) >>+{ >>+ test_clear_page_dirty_sync_ptes(page); >>+} >>+ >> static inline void set_page_writeback(struct page *page) >> { >> test_set_page_writeback(page); >>Index: linux-2.6/mm/page-writeback.c >>=================================================================== >>--- linux-2.6.orig/mm/page-writeback.c 2006-12-19 15:17:53.000000000 +1100 >>+++ linux-2.6/mm/page-writeback.c 2006-12-19 15:33:29.000000000 +1100 >>@@ -844,9 +844,10 @@ EXPORT_SYMBOL(set_page_dirty_lock); >> >> /* >> * Clear a page's dirty flag, while caring for dirty memory accounting. >>+ * Does not clear pte dirty bits. >> * Returns true if the page was previously dirty. >> */ >>-int test_clear_page_dirty(struct page *page) >>+static int test_clear_page_dirty_leave_ptes(struct page *page) >> { >> struct address_space *mapping = page_mapping(page); >> unsigned long flags; >>@@ -862,10 +863,8 @@ int test_clear_page_dirty(struct page *p >> * We can continue to use `mapping' here because the >> * page is locked, which pins the address_space >> */ >>- if (mapping_cap_account_dirty(mapping)) { >>- page_mkclean(page); >>+ if (mapping_cap_account_dirty(mapping)) >> dec_zone_page_state(page, NR_FILE_DIRTY); >>- } >> return 1; >> } >> write_unlock_irqrestore(&mapping->tree_lock, flags); >>@@ -873,9 +872,43 @@ int test_clear_page_dirty(struct page *p >> } >> return TestClearPageDirty(page); >> } >>+ >>+/* >>+ * As above, but does clear dirty bits from ptes >>+ */ >>+int test_clear_page_dirty(struct page *page) >>+{ >>+ struct address_space *mapping = page_mapping(page); >>+ >>+ if (test_clear_page_dirty_leave_ptes(page)) { >>+ if (mapping_cap_account_dirty(mapping)) >>+ page_mkclean(page); >>+ return 1; >>+ } >>+ return 0; >>+} >> EXPORT_SYMBOL(test_clear_page_dirty); >> >> /* >>+ * As above, but redirties page if any dirty ptes are found (and then only >>+ * if the mapping accounts dirty pages, otherwise dirty ptes are left dirty >>+ * but the page is cleaned). >>+ */ >>+int test_clear_page_dirty_sync_ptes(struct page *page) >>+{ >>+ struct address_space *mapping = page_mapping(page); >>+ >>+ if (test_clear_page_dirty_leave_ptes(page)) { >>+ if (mapping_cap_account_dirty(mapping)) { >>+ if (page_mkclean(page)) >>+ set_page_dirty(page); >>+ } >>+ return 1; >>+ } >>+ return 0; >>+} >>+ >>+/* >> * Clear a page's dirty flag, while caring for dirty memory accounting. >> * Returns true if the page was previously dirty. >> * > > > Hmm, not quite; It certainly look better than the extra ,[01] tagged to > test_clear_page_dirty() though. Although I would have expected it the > other way around - test_clear_pages_dirty_sync_ptes to be the default > case and test_clear_pages_dirty_clean_ptes to be used in > clear_page_dirty_for_io(). > > Anyway it has the same issues as the others. See what happens when you > run two test_clear_page_dirty_sync_ptes() consecutively, you still loose > PG_dirty even though the page might actually be dirty. How can this happen? We'll only test_clear_page_dirty_sync_ptes again after buffers have been reattached, and subsequently cleaned. And in that case if the ptes are still clean at this point then the page really is clean. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-19 7:59 ` Nick Piggin @ 2006-12-19 8:14 ` Linus Torvalds 2006-12-19 9:40 ` Nick Piggin 0 siblings, 1 reply; 311+ messages in thread From: Linus Torvalds @ 2006-12-19 8:14 UTC (permalink / raw) To: Nick Piggin Cc: Peter Zijlstra, Andrew Morton, andrei.popa, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Tue, 19 Dec 2006, Nick Piggin wrote: > > > > Anyway it has the same issues as the others. See what happens when you > > run two test_clear_page_dirty_sync_ptes() consecutively, you still loose > > PG_dirty even though the page might actually be dirty. > > How can this happen? We'll only test_clear_page_dirty_sync_ptes again > after buffers have been reattached, and subsequently cleaned. And in > that case if the ptes are still clean at this point then the page really > is clean. Why do you talk about buffers being reattached? Are you still in some world where "try_to_free_buffers()" matters? Have you not followed the discussion? Why do you ignore my MUCH SIMPLER patch that just removed all this crap ENTIRELY from "try_to_free_buffers()", and the exact same corruption happened? Forget about "try_to_free_buffers()". Please apply this patch to your tree first. That gets rid of _one_ copy of totally insane code that did all the wrong things. Only after you have applied this patch should you look at the code again. Realizing that the corruption still happens. So forget about buffers already. That piece of code was crap. Linus --- diff --git a/fs/buffer.c b/fs/buffer.c index d1f1b54..263f88e 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *page) int ret = 0; BUG_ON(!PageLocked(page)); - if (PageWriteback(page)) + if (PageDirty(page) || PageWriteback(page)) return 0; if (mapping == NULL) { /* can this still happen? */ @@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *page) spin_lock(&mapping->private_lock); ret = drop_buffers(page, &buffers_to_free); spin_unlock(&mapping->private_lock); - if (ret) { - /* - * If the filesystem writes its buffers by hand (eg ext3) - * then we can have clean buffers against a dirty page. We - * clean the page here; otherwise later reattachment of buffers - * could encounter a non-uptodate page, which is unresolvable. - * This only applies in the rare case where try_to_free_buffers - * succeeds but the page is not freed. - * - * Also, during truncate, discard_buffer will have marked all - * the page's buffers clean. We discover that here and clean - * the page also. - */ - if (test_clear_page_dirty(page)) - task_io_account_cancelled_write(PAGE_CACHE_SIZE); - } out: if (buffers_to_free) { struct buffer_head *bh = buffers_to_free; ^ permalink raw reply related [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-19 8:14 ` Linus Torvalds @ 2006-12-19 9:40 ` Nick Piggin 2006-12-19 16:46 ` Linus Torvalds 0 siblings, 1 reply; 311+ messages in thread From: Nick Piggin @ 2006-12-19 9:40 UTC (permalink / raw) To: Linus Torvalds Cc: Peter Zijlstra, Andrew Morton, andrei.popa, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr Linus Torvalds wrote: > > On Tue, 19 Dec 2006, Nick Piggin wrote: > >>>Anyway it has the same issues as the others. See what happens when you >>>run two test_clear_page_dirty_sync_ptes() consecutively, you still loose >>>PG_dirty even though the page might actually be dirty. >> >>How can this happen? We'll only test_clear_page_dirty_sync_ptes again >>after buffers have been reattached, and subsequently cleaned. And in >>that case if the ptes are still clean at this point then the page really >>is clean. > > > Why do you talk about buffers being reattached? Are you still in some > world where "try_to_free_buffers()" matters? Have you not followed the I'm talking about fixing just the race Andrew noticed via inspection. No it doesn't appear to fix Andrei's problem, unfortunately. But it needs to be fixed all the same, doesn't it? > discussion? Why do you ignore my MUCH SIMPLER patch that just removed all > this crap ENTIRELY from "try_to_free_buffers()", and the exact same > corruption happened? > > Forget about "try_to_free_buffers()". Please apply this patch to your tree > first. That gets rid of _one_ copy of totally insane code that did all the > wrong things. > > Only after you have applied this patch should you look at the code again. > Realizing that the corruption still happens. > > So forget about buffers already. That piece of code was crap. Now I'm not exactly sure how ext3 (or any other) filesystems make use of this particular feature of try_to_free_buffers(), but it is clear from the comments what it is for. So your patch isn't really a minimal fix (ie. it would require an OK from all filesystems, wouldn't it?) Or did I miss a mail where you reasoned that it is safe to make this change (/me goes to reread the thread)... > > Linus > > --- > diff --git a/fs/buffer.c b/fs/buffer.c > index d1f1b54..263f88e 100644 > --- a/fs/buffer.c > +++ b/fs/buffer.c > @@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *page) > int ret = 0; > > BUG_ON(!PageLocked(page)); > - if (PageWriteback(page)) > + if (PageDirty(page) || PageWriteback(page)) > return 0; > > if (mapping == NULL) { /* can this still happen? */ > @@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *page) > spin_lock(&mapping->private_lock); > ret = drop_buffers(page, &buffers_to_free); > spin_unlock(&mapping->private_lock); > - if (ret) { > - /* > - * If the filesystem writes its buffers by hand (eg ext3) > - * then we can have clean buffers against a dirty page. We > - * clean the page here; otherwise later reattachment of buffers > - * could encounter a non-uptodate page, which is unresolvable. > - * This only applies in the rare case where try_to_free_buffers > - * succeeds but the page is not freed. > - * > - * Also, during truncate, discard_buffer will have marked all > - * the page's buffers clean. We discover that here and clean > - * the page also. > - */ > - if (test_clear_page_dirty(page)) > - task_io_account_cancelled_write(PAGE_CACHE_SIZE); > - } > out: > if (buffers_to_free) { > struct buffer_head *bh = buffers_to_free; > -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 311+ messages in thread
* Re: 2.6.19 file content corruption on ext3 2006-12-19 9:40 ` Nick Piggin @ 2006-12-19 16:46 ` Linus Torvalds 0 siblings, 0 replies; 311+ messages in thread From: Linus Torvalds @ 2006-12-19 16:46 UTC (permalink / raw) To: Nick Piggin Cc: Peter Zijlstra, Andrew Morton, andrei.popa, Linux Kernel Mailing List, Hugh Dickins, Florian Weimer, Marc Haber, Martin Michlmayr On Tue, 19 Dec 2006, Nick Piggin wrote: > > Now I'm not exactly sure how ext3 (or any other) filesystems make use > of this particular feature of try_to_free_buffers(), but it is clear > from the comments what it is for. So your patch isn't really a minimal > fix (ie. it would require an OK from all filesystems, wouldn't it?) > > Or did I miss a mail where you reasoned that it is safe to make this > change (/me goes to reread the thread)... I'm saying it had _better_ be safe, and no, low-level filesystems don't actually matter. The page has to be cleanable _some_ way. So if we test for "page_dirty()" at the top, and just refuse to do it in try_to_free_pages(), we still know that the _proper_ page cleaning had better clean it. Because ttfp() is never going to clean the page in the general case _anyway_. So I'm really saying: - the page WILL be cleaned by the real page cleaning action (ie memory pressure or sync or something else causing us to go through the bog-standard page-based writeout. Does anybody dispute this? - the "ttfp()" hack was a HACK. It was an ugly and nasty hack even when it was first introduced. It gets doubly worse now that we know we have something wrong with page cleaning, and it has distracted from the real problem. - I removed tha ugly and disgusting hack entirely at first, but Andrew points out that he really wants to keep the buffers there, because the buffers being clean actually say something. That, together with the fact that as long as the page is dirty, the buffers really do end up have a job to do, made me add a much smaller hack to replace the big ugly one ("don't even try, if the page is marked dirty"). - so with that thing in place, there isn't even any change in behaviour wrt the buffers and low-level filesystems. It's just that we make them a bit harder to get rid of. But arguably that shouldn't actually ever really _happen_ anyway (because I think it's a BUG if the page is marked dirty but none of the buffers are), so I think that part is a non-issue. In other words, ttfp() _never_ had anything to do with "page cleaning". Not originally, not with the horrible hack, and not with my patch. Trying to mix it in just caused a bug that _everybody_ agrees is a bug. It's not the bug we're chasing, but we've got three different patches to fix it (Andrew's, mine and yours), and mine is the simplest one by far especially in the long run, because it just REMOVES the ugly dependency. And yes, I probably care more about "in the long run" than most. To me, a bug is a bug even if it's _just_ a maintenance headache. Andrews patch made things _worse_ ("magic insane flag"), and while yours didn't make the code worse, it still introduced the notion of a totally insane "clean the page but if the PTE's are dirty, do something else" notion. IF THE PAGE TRULY IS CLEAN (and both you and Andrew claim it is, if all buffers are clean - since you mark it clean in the non-mapped case) THEN YOU SHOULD BE ABLE TO CLEAN THE PAGE TABLE BITS TOO. And by claiming that the page table bits are different from PG_dirty, you're just making the issues worse. They shouldn't be. That's what the whole point of Peter's patch was: PG_dirty fundmentally _means_ that the page tables might be dirty too. That was the whole _point_ of doing all this in 2.6.19 in the first place. So if you cannot accept that page table bits should be on "equal footing" with PG_dirty, then you should just say "Let's remove Peter's patch entirely". Linus ^ permalink raw reply [flat|nested] 311+ messages in thread
end of thread, other threads:[~2007-01-07 6:05 UTC | newest] Thread overview: 311+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2006-12-17 0:13 2.6.19 file content corruption on ext3 Andrei Popa 2006-12-17 12:06 ` Andrew Morton 2006-12-17 12:19 ` Marc Haber 2006-12-17 12:32 ` Andrei Popa 2006-12-17 13:39 ` Andrei Popa 2006-12-17 23:40 ` Andrew Morton 2006-12-18 1:02 ` Linus Torvalds 2006-12-18 1:22 ` Linus Torvalds 2006-12-18 1:29 ` Linus Torvalds 2006-12-18 1:57 ` Linus Torvalds 2006-12-18 4:51 ` Nick Piggin 2006-12-18 5:43 ` Andrew Morton 2006-12-18 7:22 ` Nick Piggin 2006-12-18 9:18 ` Andrew Morton 2006-12-18 9:26 ` Andrei Popa 2006-12-18 9:42 ` Nick Piggin 2006-12-19 8:51 ` Marc Haber 2006-12-19 9:28 ` Martin Michlmayr 2006-12-28 18:05 ` Marc Haber 2006-12-28 19:00 ` Linus Torvalds 2006-12-28 19:05 ` Petri Kaukasoina 2006-12-28 19:21 ` Linus Torvalds 2006-12-28 19:39 ` Dave Jones 2006-12-28 20:10 ` Arjan van de Ven 2006-12-29 9:23 ` maximilian attems 2006-12-29 15:02 ` Dave Jones 2006-12-29 18:52 ` maximilian attems 2006-12-29 19:14 ` Dave Jones 2006-12-28 21:24 ` Linus Torvalds 2006-12-28 21:36 ` Russell King 2006-12-28 22:37 ` Linus Torvalds 2006-12-28 22:50 ` David Miller 2006-12-28 23:01 ` Linus Torvalds 2006-12-29 1:38 ` Linus Torvalds 2006-12-29 1:59 ` Andrew Morton 2006-12-28 23:36 ` Anton Altaparmakov 2006-12-28 23:54 ` Linus Torvalds 2006-12-29 17:49 ` Guillaume Chazarain 2006-12-18 5:50 ` Linus Torvalds 2006-12-18 7:16 ` Andrew Morton 2006-12-18 7:17 ` Andrew Morton 2006-12-18 9:30 ` Nick Piggin 2006-12-18 7:30 ` Nick Piggin 2006-12-18 9:19 ` Andrei Popa 2006-12-18 9:38 ` Andrew Morton 2006-12-18 10:00 ` Andrei Popa 2006-12-18 10:11 ` Peter Zijlstra 2006-12-18 10:49 ` Andrei Popa 2006-12-18 15:24 ` Gene Heskett 2006-12-18 15:32 ` Peter Zijlstra 2006-12-18 15:47 ` Gene Heskett 2006-12-18 16:55 ` Peter Zijlstra 2006-12-18 18:03 ` Linus Torvalds 2006-12-18 18:24 ` Peter Zijlstra 2006-12-18 18:35 ` Linus Torvalds 2006-12-18 19:04 ` Andrei Popa 2006-12-18 19:10 ` Peter Zijlstra 2006-12-18 19:18 ` Linus Torvalds 2006-12-18 19:44 ` Andrei Popa 2006-12-18 20:14 ` Linus Torvalds 2006-12-18 20:41 ` Linus Torvalds 2006-12-18 21:11 ` Andrei Popa 2006-12-18 22:00 ` Alessandro Suardi 2006-12-18 22:45 ` Linus Torvalds 2006-12-19 0:13 ` Andrei Popa 2006-12-19 0:29 ` Linus Torvalds 2006-12-18 22:32 ` Linus Torvalds 2006-12-18 23:48 ` Andrei Popa 2006-12-19 0:04 ` Linus Torvalds 2006-12-19 0:29 ` Andrei Popa 2006-12-19 0:57 ` Linus Torvalds 2006-12-19 1:21 ` Andrew Morton 2006-12-19 1:44 ` Andrei Popa 2006-12-19 1:54 ` Andrew Morton 2006-12-19 2:04 ` Andrei Popa 2006-12-19 8:05 ` Andrei Popa 2006-12-19 8:24 ` Andrew Morton 2006-12-19 8:34 ` Pekka Enberg 2006-12-19 9:13 ` Marc Haber 2006-12-19 1:50 ` Andrei Popa 2006-12-19 1:03 ` Gene Heskett 2006-12-18 22:34 ` Gene Heskett 2006-12-22 17:27 ` Linus Torvalds 2006-12-18 21:43 ` Andrew Morton 2006-12-18 21:49 ` Peter Zijlstra 2006-12-19 23:42 ` Peter Zijlstra 2006-12-20 0:23 ` Linus Torvalds 2006-12-20 9:01 ` Peter Zijlstra 2006-12-20 9:12 ` Peter Zijlstra 2006-12-20 9:39 ` Arjan van de Ven 2006-12-20 11:26 ` [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) Peter Zijlstra 2006-12-20 11:39 ` Jesper Juhl 2006-12-20 11:42 ` Peter Zijlstra 2006-12-20 12:12 ` Jesper Juhl 2006-12-20 13:00 ` Hugh Dickins 2006-12-20 13:56 ` Peter Zijlstra 2006-12-20 17:03 ` Martin Michlmayr 2006-12-20 17:35 ` Linus Torvalds 2006-12-20 17:53 ` Martin Michlmayr 2006-12-20 19:01 ` Linus Torvalds 2006-12-20 19:50 ` Linus Torvalds 2006-12-20 20:22 ` Peter Zijlstra 2006-12-20 21:55 ` Dave Kleikamp 2006-12-20 22:25 ` Linus Torvalds 2006-12-20 22:59 ` Dave Kleikamp 2006-12-20 22:15 ` Peter Zijlstra 2006-12-20 22:20 ` Peter Zijlstra 2006-12-20 22:49 ` Linus Torvalds 2006-12-20 23:03 ` Peter Zijlstra 2006-12-21 9:16 ` Martin Schwidefsky 2006-12-21 9:20 ` Peter Zijlstra 2006-12-21 9:26 ` Martin Schwidefsky 2006-12-21 20:01 ` Linus Torvalds 2006-12-28 0:00 ` Martin Schwidefsky 2006-12-28 0:42 ` Linus Torvalds 2006-12-28 0:52 ` [PATCH] mm: fix page_mkclean_one David Miller 2006-12-21 2:36 ` [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) Trond Myklebust 2006-12-21 8:10 ` Peter Zijlstra 2006-12-20 23:24 ` David Chinner 2006-12-20 23:55 ` Linus Torvalds 2006-12-21 1:20 ` David Chinner 2006-12-20 23:32 ` Andrew Morton 2006-12-20 23:55 ` Linus Torvalds 2006-12-21 0:11 ` Andrew Morton 2006-12-21 0:22 ` Linus Torvalds 2006-12-21 0:24 ` Linus Torvalds 2006-12-21 15:48 ` Andrei Popa 2006-12-21 16:58 ` Linus Torvalds 2006-12-21 0:43 ` Linus Torvalds 2006-12-21 1:20 ` Andrew Morton 2006-12-21 2:54 ` Trond Myklebust 2006-12-21 17:19 ` Linus Torvalds 2006-12-21 7:32 ` Gordon Farquharson 2006-12-21 7:53 ` Linus Torvalds 2006-12-21 8:38 ` Martin Michlmayr 2006-12-21 8:59 ` Linus Torvalds 2006-12-21 9:17 ` Gordon Farquharson 2006-12-21 9:27 ` Andrew Morton 2006-12-22 4:20 ` Gordon Farquharson 2006-12-22 4:54 ` Linus Torvalds 2006-12-22 10:00 ` Martin Michlmayr 2006-12-22 10:06 ` Martin Michlmayr 2006-12-22 10:10 ` Martin Michlmayr 2006-12-22 11:07 ` Martin Michlmayr 2006-12-22 15:30 ` Gordon Farquharson 2006-12-22 17:11 ` Martin Michlmayr 2006-12-22 10:17 ` Andrew Morton 2006-12-22 11:12 ` Martin Michlmayr 2006-12-22 12:24 ` Andrei Popa 2006-12-22 12:32 ` Martin Michlmayr 2006-12-22 12:59 ` Martin Michlmayr 2006-12-22 13:25 ` Peter Zijlstra 2006-12-22 13:29 ` Peter Zijlstra 2006-12-22 17:56 ` Linus Torvalds 2006-12-22 19:20 ` Martin Michlmayr 2006-12-24 8:10 ` Gordon Farquharson 2006-12-24 8:43 ` Linus Torvalds 2006-12-24 8:57 ` Andrew Morton 2006-12-24 9:26 ` Linus Torvalds 2006-12-24 12:14 ` Andrei Popa 2006-12-24 12:26 ` Andrei Popa 2006-12-24 12:30 ` Andrew Morton 2006-12-24 12:31 ` Andrew Morton 2006-12-24 16:45 ` Andrei Popa 2006-12-24 17:16 ` Linus Torvalds 2006-12-24 18:07 ` Andrew Morton 2006-12-24 18:37 ` Linus Torvalds 2006-12-24 19:18 ` Linus Torvalds 2006-12-24 20:55 ` Gordon Farquharson 2006-12-26 10:31 ` Nick Piggin 2006-12-26 19:26 ` Linus Torvalds 2006-12-27 12:32 ` Jari Sundell 2006-12-27 12:44 ` valdyn 2006-12-27 13:33 ` Jari Sundell 2007-01-07 2:06 ` Tom Lanyon 2007-01-07 5:58 ` Tom Lanyon 2007-01-07 6:05 ` Andrew Morton 2006-12-24 21:21 ` Michael S. Tsirkin 2006-12-24 19:27 ` Gordon Farquharson 2006-12-24 19:35 ` Linus Torvalds 2006-12-24 20:10 ` Andrei Popa 2006-12-24 20:24 ` Linus Torvalds 2006-12-24 20:30 ` Andrei Popa 2006-12-26 17:51 ` Al Viro 2006-12-26 17:58 ` Al Viro 2006-12-24 22:01 ` Martin Michlmayr 2006-12-24 14:05 ` Martin Michlmayr 2006-12-26 16:17 ` Tobias Diedrich 2006-12-27 4:55 ` [PATCH] mm: fix page_mkclean_one David Miller 2006-12-27 7:00 ` Linus Torvalds 2006-12-27 8:39 ` Andrei Popa 2006-12-28 0:16 ` Linus Torvalds 2006-12-28 0:39 ` Linus Torvalds 2006-12-28 0:52 ` David Miller 2006-12-28 3:04 ` Linus Torvalds 2006-12-28 4:32 ` Gordon Farquharson 2006-12-28 4:53 ` Linus Torvalds 2006-12-28 5:20 ` Gordon Farquharson 2006-12-28 5:41 ` David Miller 2006-12-28 5:47 ` Gordon Farquharson 2006-12-28 10:13 ` Russell King 2006-12-28 14:15 ` Gordon Farquharson 2006-12-28 15:53 ` Martin Michlmayr 2006-12-28 17:27 ` Linus Torvalds 2006-12-28 18:44 ` Russell King 2006-12-28 19:01 ` Linus Torvalds [not found] ` <97a0a9ac0612272115g4cce1f08n3c3c8498a6076bd5@mail.gmail.com> [not found] ` <Pine.LNX.4.64.0612272120180.4473@woody.osdl.org> 2006-12-28 5:38 ` Gordon Farquharson 2006-12-28 9:30 ` Martin Michlmayr 2006-12-28 10:16 ` Martin Michlmayr 2006-12-28 10:49 ` Russell King 2006-12-28 14:56 ` Martin Michlmayr 2006-12-28 5:58 ` Gordon Farquharson 2006-12-28 17:08 ` Linus Torvalds 2006-12-28 5:55 ` Chen, Kenneth W 2006-12-28 6:10 ` Chen, Kenneth W 2006-12-28 6:27 ` David Miller 2006-12-28 17:10 ` Linus Torvalds 2006-12-28 9:15 ` Zhang, Yanmin 2006-12-28 17:15 ` Linus Torvalds 2006-12-28 11:50 ` Petri Kaukasoina 2006-12-28 15:09 ` Guillaume Chazarain 2006-12-28 19:19 ` Guillaume Chazarain 2006-12-28 19:28 ` Linus Torvalds 2006-12-28 19:45 ` Andrew Morton 2006-12-28 20:14 ` Linus Torvalds 2006-12-28 22:38 ` David Miller 2006-12-29 2:50 ` Segher Boessenkool 2006-12-29 6:48 ` Linus Torvalds 2006-12-29 8:58 ` Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one) Linus Torvalds 2006-12-29 10:48 ` Linus Torvalds 2006-12-29 11:16 ` Andrei Popa 2006-12-29 12:09 ` Nick Piggin 2006-12-29 17:25 ` Linus Torvalds 2006-12-29 12:31 ` Ingo Molnar 2006-12-29 13:08 ` Martin Johansson 2006-12-29 14:08 ` Martin Michlmayr 2006-12-29 15:17 ` Stephen Clark 2006-12-29 15:54 ` Martin Michlmayr 2006-12-29 22:16 ` Andrew Morton 2006-12-29 22:24 ` Andrew Morton 2006-12-29 22:42 ` Linus Torvalds 2006-12-29 23:32 ` Theodore Tso 2006-12-29 23:59 ` Linus Torvalds 2006-12-30 0:05 ` Andrew Morton 2006-12-30 0:50 ` Linus Torvalds 2006-12-29 23:51 ` Andrew Morton 2006-12-30 0:11 ` Linus Torvalds 2006-12-30 0:33 ` Andrew Morton 2006-12-30 0:58 ` Linus Torvalds 2006-12-30 1:16 ` Andrew Morton 2006-12-29 15:27 ` Theodore Tso 2006-12-29 17:51 ` Linus Torvalds 2006-12-29 12:19 ` [patch] fix data corruption bug in __block_write_full_page() Ingo Molnar 2007-01-02 11:20 ` Christoph Hellwig 2007-01-02 12:06 ` Ingo Molnar 2007-01-02 12:16 ` Christoph Hellwig 2006-12-28 22:35 ` [PATCH] mm: fix page_mkclean_one Mike Galbraith 2006-12-22 15:01 ` [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) Patrick Mau 2006-12-23 8:15 ` Andrei Popa 2006-12-22 15:08 ` Gordon Farquharson 2006-12-22 10:01 ` Martin Michlmayr 2006-12-22 15:16 ` Gordon Farquharson 2006-12-21 12:30 ` Russell King 2006-12-21 12:36 ` Russell King 2006-12-21 11:21 ` Martin Michlmayr 2006-12-20 22:11 ` Russell King 2006-12-21 8:18 ` Martin Michlmayr 2006-12-21 9:54 ` Russell King 2006-12-20 14:55 ` Martin Schwidefsky 2006-12-20 14:27 ` 2.6.19 file content corruption on ext3 Martin Schwidefsky 2006-12-20 9:32 ` Peter Zijlstra 2006-12-20 14:15 ` Andrei Popa 2006-12-20 14:23 ` Peter Zijlstra 2006-12-20 16:30 ` Andrei Popa 2006-12-20 16:36 ` Peter Zijlstra 2006-12-19 7:38 ` Peter Zijlstra 2006-12-19 4:36 ` Nick Piggin 2006-12-19 6:34 ` Linus Torvalds 2006-12-19 6:51 ` Nick Piggin 2006-12-19 7:26 ` Linus Torvalds 2006-12-19 8:04 ` Linus Torvalds 2006-12-19 9:00 ` Peter Zijlstra 2006-12-19 9:05 ` Peter Zijlstra [not found] ` <4587B762.2030603@yahoo.com.au> 2006-12-19 10:32 ` Andrew Morton 2006-12-19 10:42 ` Nick Piggin 2006-12-19 10:47 ` Andrew Morton 2006-12-19 10:52 ` Peter Zijlstra 2006-12-19 10:58 ` Nick Piggin 2006-12-19 11:51 ` Peter Zijlstra 2006-12-19 10:55 ` Nick Piggin 2006-12-19 16:51 ` Linus Torvalds 2006-12-19 17:43 ` Linus Torvalds 2006-12-19 18:59 ` Linus Torvalds 2006-12-19 21:30 ` Peter Zijlstra 2006-12-19 22:51 ` Linus Torvalds 2006-12-19 22:58 ` Andrew Morton 2006-12-19 23:06 ` Peter Zijlstra 2006-12-19 23:07 ` Peter Zijlstra 2006-12-20 0:03 ` Linus Torvalds 2006-12-20 0:18 ` Andrew Morton 2006-12-20 18:02 ` Stephen Clark 2006-12-20 5:56 ` Jari Sundell 2006-12-19 21:56 ` Florian Weimer 2006-12-21 13:03 ` Peter Zijlstra 2006-12-21 20:40 ` Andrew Morton 2006-12-19 20:03 ` dean gaudet 2006-12-19 7:22 ` Peter Zijlstra 2006-12-19 7:59 ` Nick Piggin 2006-12-19 8:14 ` Linus Torvalds 2006-12-19 9:40 ` Nick Piggin 2006-12-19 16:46 ` Linus Torvalds
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).