linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH] mm: fix page_mkclean_one
@ 2007-02-01 22:40 Mark Groves
  2007-02-02  8:42 ` Nick Piggin
  0 siblings, 1 reply; 45+ messages in thread
From: Mark Groves @ 2007-02-01 22:40 UTC (permalink / raw)
  To: linux-kernel

Hi,


I have been been seeing a problem when using sendfile repeatedly on an
SMP server, which I believe is related to the problem that was
discovered recently with marking dirty pages. The bug, as well as a test
script, is listed at http://bugzilla.kernel.org/show_bug.cgi?id=7650.
Currently, we're experiencing errors where part of a previous packet is
being sent out rather than the current packet.

I have applied the patch Linus posted to a 2.6.19 kernel but am still
getting the problem. So I am wondering if there are any other places in
the kernel which mark pages as dirty which might require a similar
patch?


Regards,

Mark Groves
Researcher
University of Waterloo
Waterloo, Ontario, Canada


^ permalink raw reply	[flat|nested] 45+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one
@ 2006-12-28 15:50 martin
  0 siblings, 0 replies; 45+ messages in thread
From: martin @ 2006-12-28 15:50 UTC (permalink / raw)
  To: Linus Torvalds, Guillaume Chazarain
  Cc: David Miller, ranma, gordonfarquharson, tbm, Peter Zijlstra,
	andrei.popa, Andrew Morton, hugh, nickpiggin, arjan,
	Linux Kernel Mailing List


On Thu Dec 28 15:09 , Guillaume Chazarain sent:

>I set a qemu environment to test kernels: http://guichaz.free.fr/linux-bug/
>I have corruption with every Fedora release kernel except the first, that is
>2.4.22 works, but 2.6.5, 2.6.9, 2.6.11, 2.6.15 and 2.6.18-1.2798 exhibit 
>some corruption.

Confirm that I see corruption with Fedora kernel 2.6.18-1.2239.fc5:

...
Chunk 142969 corrupted (0-1459)  (2580-4039)
Expected 121, got 0
Written as (89632)127652(124721)
Chunk 142976 corrupted (0-1459)  (512-1971)
Expected 128, got 0
Written as (121734)128324(108589)
Checking chunk 143639/143640 (99%)
$ uname -a
Linux 2.6.18-1.2239.fc5 #1 Fri Nov 10 13:04:06 EST 2006 i686 athlon i386 GNU/Linux

/Martin

^ permalink raw reply	[flat|nested] 45+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
@ 2006-12-24  8:10 Gordon Farquharson
  2006-12-24  8:43 ` Linus Torvalds
  0 siblings, 1 reply; 45+ messages in thread
From: Gordon Farquharson @ 2006-12-24  8:10 UTC (permalink / raw)
  To: Martin Michlmayr
  Cc: Peter Zijlstra, Andrei Popa, Andrew Morton, Linus Torvalds,
	Hugh Dickins, Nick Piggin, Arjan van de Ven,
	Linux Kernel Mailing List

On 12/22/06, Martin Michlmayr <tbm@cyrius.com> wrote:

> * Peter Zijlstra <a.p.zijlstra@chello.nl> [2006-12-22 14:25]:
> > > .... and it failed.
> > Since you are on ARM you might want to try with the page_mkclean_one
> > cleanup patch too.
>
> I've already tried it and it didn't work.  I just tried it again
> together with Linus' patch and the two from Andrew and it still fails.
> (For reference, the patch is attached.)

I can confirm this behaviour with 2.6.19 and the patches mentioned
above (cumulative patch for 2.6.19 appended to the end of this email).

Is there any way to provide any debugging information that may help
solve the problem ? Would it help to know the nature of the corruption
e.g. an analysis of the corruption in the file ? I have previously
asked apt developers if they wanted to look at the corrupted cache
files, but there were no takers then.

BTW, I decided to try Linus's test program [1] on ARM (I don't think
that anybody had tried it on ARM before).

Since we see file corruption with 2.6.18 + [PATCH] mm: tracking shared
dirty pages [2], I ran Linus's program on machines with the following
setups:

2.6.18 + the following patches
   mm: tracking shared dirty pages [2]
   mm: balance dirty pages [3]
   mm: optimize the new mprotect() code a bit [4]
   mm: small cleanup of install_page() [5]
   mm: fixup do_wp_page() [6]
   mm: msync() cleanup [7]

$ ./mm-test | od -x
0000000 aaaa aaaa aaaa aaaa aaaa 0000 0000 0000
0000020 0000 0000 5555 5555 5555 5555 5555 5555
0000040 5555 5555 5555 5555
0000050

2.6.18 (no mm patches)

$ ./mm-test | od -x
0000000 aaaa aaaa aaaa aaaa aaaa aaaa aaaa aaaa
0000020 aaaa aaaa 5555 5555 5555 5555 5555 5555
0000040 5555 5555 5555 5555
0000050

I don't know if this helps at all.

Gordon

[1] http://lkml.org/lkml/2006/12/19/200
[2] http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=d08b3851da41d0ee60851f2c75b118e1f7a5fc89
[3] http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=edc79b2a46ed854595e40edcf3f8b37f9f14aa3f
[4] http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=c1e6098b23bb46e2b488fe9a26f831f867157483
[5] http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=e88dd6c11c5aef74d8b74a062767add53315533b
[6] http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=ee6a6457886a80415db209e87033b63f2b06558c
[7] http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=204ec841fbea3e5138168edbc3a76d46747cc987

diff -Naupr linux-2.6.19.orig/fs/buffer.c linux-2.6.19/fs/buffer.c
--- linux-2.6.19.orig/fs/buffer.c       2006-11-29 14:57:37.000000000 -0700
+++ linux-2.6.19/fs/buffer.c    2006-12-21 01:16:31.000000000 -0700
@@ -2832,7 +2832,7 @@ int try_to_free_buffers(struct page *pag
        int ret = 0;

        BUG_ON(!PageLocked(page));
-       if (PageWriteback(page))
+       if (PageDirty(page) || PageWriteback(page))
                return 0;

        if (mapping == NULL) {          /* can this still happen? */
@@ -2843,17 +2843,6 @@ int try_to_free_buffers(struct page *pag
        spin_lock(&mapping->private_lock);
        ret = drop_buffers(page, &buffers_to_free);
        spin_unlock(&mapping->private_lock);
-       if (ret) {
-               /*
-                * If the filesystem writes its buffers by hand (eg ext3)
-                * then we can have clean buffers against a dirty page.  We
-                * clean the page here; otherwise later reattachment of buffers
-                * could encounter a non-uptodate page, which is unresolvable.
-                * This only applies in the rare case where try_to_free_buffers
-                * succeeds but the page is not freed.
-                */
-               clear_page_dirty(page);
-       }
 out:
        if (buffers_to_free) {
                struct buffer_head *bh = buffers_to_free;
diff -Naupr linux-2.6.19.orig/fs/hugetlbfs/inode.c
linux-2.6.19/fs/hugetlbfs/inode.c
--- linux-2.6.19.orig/fs/hugetlbfs/inode.c      2006-11-29
14:57:37.000000000 -0700
+++ linux-2.6.19/fs/hugetlbfs/inode.c   2006-12-21 01:15:21.000000000 -0700
@@ -176,7 +176,7 @@ static int hugetlbfs_commit_write(struct

  static void truncate_huge_page(struct page *page)
 {
-       clear_page_dirty(page);
+       cancel_dirty_page(page, /* No IO accounting for huge pages? */0);
        ClearPageUptodate(page);
        remove_from_page_cache(page);
        put_page(page);
diff -Naupr linux-2.6.19.orig/include/linux/page-flags.h
linux-2.6.19/include/linux/page-flags.h
--- linux-2.6.19.orig/include/linux/page-flags.h        2006-11-29
14:57:37.000000000 -0700
+++ linux-2.6.19/include/linux/page-flags.h     2006-12-21
01:15:21.000000000 -0700
@@ -253,15 +253,11 @@ static inline void SetPageUptodate(struc

  struct page;   /* forward declaration */

-int test_clear_page_dirty(struct page *page);
+extern void cancel_dirty_page(struct page *page, unsigned int account_size);
+
  int test_clear_page_writeback(struct page *page);
  int test_set_page_writeback(struct page *page);

-static inline void clear_page_dirty(struct page *page)
-{
-       test_clear_page_dirty(page);
-}
-
  static inline void set_page_writeback(struct page *page)
 {
        test_set_page_writeback(page);
diff -Naupr linux-2.6.19.orig/mm/memory.c linux-2.6.19/mm/memory.c
--- linux-2.6.19.orig/mm/memory.c       2006-11-29 14:57:37.000000000 -0700
+++ linux-2.6.19/mm/memory.c    2006-12-21 01:15:21.000000000 -0700
@@ -1832,6 +1832,33 @@ void unmap_mapping_range(struct address_
 }
 EXPORT_SYMBOL(unmap_mapping_range);

+static void check_last_page(struct address_space *mapping, loff_t size)
+{
+       pgoff_t index;
+       unsigned int offset;
+       struct page *page;
+
+       if (!mapping)
+               return;
+       offset = size & ~PAGE_MASK;
+       if (!offset)
+               return;
+       index = size >> PAGE_SHIFT;
+       page = find_lock_page(mapping, index);
+       if (page) {
+               unsigned int check = 0;
+               unsigned char *kaddr = kmap_atomic(page, KM_USER0);
+               do {
+                       check += kaddr[offset++];
+               } while (offset < PAGE_SIZE);
+               kunmap_atomic(kaddr,KM_USER0);
+               unlock_page(page);
+               page_cache_release(page);
+               if (check)
+                       printk("%s: BADNESS: truncate check %u\n",
current->comm, check);
+       }
+}
+
 /**
  * vmtruncate - unmap mappings "freed" by truncate() syscall
  * @inode: inode of the file used
@@ -1865,6 +1892,7 @@ do_expand:
                goto out_sig;
        if (offset > inode->i_sb->s_maxbytes)
                goto out_big;
+       check_last_page(mapping, inode->i_size);
        i_size_write(inode, offset);

 out_truncate:
diff -Naupr linux-2.6.19.orig/mm/page-writeback.c
linux-2.6.19/mm/page-writeback.c
--- linux-2.6.19.orig/mm/page-writeback.c       2006-11-29
14:57:37.000000000 -0700
+++ linux-2.6.19/mm/page-writeback.c    2006-12-21 01:26:53.000000000 -0700
@@ -843,39 +843,6 @@ int set_page_dirty_lock(struct page *pag
 EXPORT_SYMBOL(set_page_dirty_lock);

 /*
- * Clear a page's dirty flag, while caring for dirty memory accounting.
- * Returns true if the page was previously dirty.
- */
-int test_clear_page_dirty(struct page *page)
-{
-       struct address_space *mapping = page_mapping(page);
-       unsigned long flags;
-
-       if (mapping) {
-               write_lock_irqsave(&mapping->tree_lock, flags);
-               if (TestClearPageDirty(page)) {
-                       radix_tree_tag_clear(&mapping->page_tree,
-                                               page_index(page),
-                                               PAGECACHE_TAG_DIRTY);
-                       write_unlock_irqrestore(&mapping->tree_lock, flags);
-                       /*
-                        * We can continue to use `mapping' here because the
-                        * page is locked, which pins the address_space
-                        */
-                       if (mapping_cap_account_dirty(mapping)) {
-                               page_mkclean(page);
-                               dec_zone_page_state(page, NR_FILE_DIRTY);
-                       }
-                       return 1;
-               }
-               write_unlock_irqrestore(&mapping->tree_lock, flags);
-               return 0;
-       }
-       return TestClearPageDirty(page);
-}
-EXPORT_SYMBOL(test_clear_page_dirty);
-
-/*
  * Clear a page's dirty flag, while caring for dirty memory accounting.
  * Returns true if the page was previously dirty.
  *
diff -Naupr linux-2.6.19.orig/mm/rmap.c linux-2.6.19/mm/rmap.c
--- linux-2.6.19.orig/mm/rmap.c 2006-11-29 14:57:37.000000000 -0700
+++ linux-2.6.19/mm/rmap.c      2006-12-22 23:25:09.000000000 -0700
@@ -432,7 +432,7 @@ static int page_mkclean_one(struct page
 {
        struct mm_struct *mm = vma->vm_mm;
        unsigned long address;
-       pte_t *pte, entry;
+       pte_t *pte;
        spinlock_t *ptl;
        int ret = 0;

@@ -444,17 +444,18 @@ static int page_mkclean_one(struct page
        if (!pte)
                goto out;

-       if (!pte_dirty(*pte) && !pte_write(*pte))
-               goto unlock;
+       if (pte_dirty(*pte) || pte_write(*pte)) {
+               pte_t entry;

-       entry = ptep_get_and_clear(mm, address, pte);
-       entry = pte_mkclean(entry);
-       entry = pte_wrprotect(entry);
-       ptep_establish(vma, address, pte, entry);
-       lazy_mmu_prot_update(entry);
-       ret = 1;
+               flush_cache_page(vma, address, pte_pfn(*pte));
+               entry = ptep_clear_flush(vma, address, pte);
+               entry = pte_wrprotect(entry);
+               entry = pte_mkclean(entry);
+               set_pte_at(vma, address, pte, entry);
+               lazy_mmu_prot_update(entry);
+               ret = 1;
+       }

-unlock:
        pte_unmap_unlock(pte, ptl);
 out:
        return ret;
@@ -489,6 +490,8 @@ int page_mkclean(struct page *page)
                if (mapping)
                        ret = page_mkclean_file(mapping, page);
        }
+       if (page_test_and_clear_dirty(page))
+               ret = 1;

        return ret;
 }
@@ -587,8 +590,6 @@ void page_remove_rmap(struct page *page)
                 * Leaving it set also helps swapoff to reinstate ptes
                 * faster for those pages still in swapcache.
                 */
-               if (page_test_and_clear_dirty(page))
-                       set_page_dirty(page);
                __dec_zone_page_state(page,
                                PageAnon(page) ? NR_ANON_PAGES :
NR_FILE_MAPPED);
        }
@@ -607,6 +608,7 @@ static int try_to_unmap_one(struct page
        pte_t pteval;
        spinlock_t *ptl;
        int ret = SWAP_AGAIN;
+       struct page *dirty_page = NULL;

        address = vma_address(page, vma);
        if (address == -EFAULT)
@@ -633,7 +635,7 @@ static int try_to_unmap_one(struct page

        /* Move the dirty bit to the physical page now the pte is gone. */
        if (pte_dirty(pteval))
-               set_page_dirty(page);
+               dirty_page = page;

        /* Update high watermark before we lower rss */
        update_hiwater_rss(mm);
@@ -684,6 +686,8 @@ static int try_to_unmap_one(struct page

 out_unmap:
        pte_unmap_unlock(pte, ptl);
+       if (dirty_page)
+               set_page_dirty(dirty_page);
 out:
        return ret;
 }
@@ -915,6 +919,9 @@ int try_to_unmap(struct page *page, int
        else
                ret = try_to_unmap_file(page, migration);

+       if (page_test_and_clear_dirty(page))
+               set_page_dirty(page);
+
        if (!page_mapped(page))
                ret = SWAP_SUCCESS;
        return ret;
diff -Naupr linux-2.6.19.orig/mm/truncate.c linux-2.6.19/mm/truncate.c
--- linux-2.6.19.orig/mm/truncate.c     2006-11-29 14:57:37.000000000 -0700
+++ linux-2.6.19/mm/truncate.c  2006-12-23 13:21:42.000000000 -0700
@@ -50,6 +50,21 @@ static inline void truncate_partial_page
                do_invalidatepage(page, partial);
 }

+void cancel_dirty_page(struct page *page, unsigned int account_size)
+{
+       /* If we're cancelling the page, it had better not be mapped
any more */+       if (page_mapped(page)) {
+               static unsigned int warncount;
+
+               WARN_ON(++warncount < 5);
+       }
+
+       if (TestClearPageDirty(page) && account_size &&
+                       mapping_cap_account_dirty(page->mapping))
+               dec_zone_page_state(page, NR_FILE_DIRTY);
+}
+
+
 /*
  * If truncate cannot remove the fs-private metadata from the page, the page
  * becomes anonymous.  It will be left on the LRU and may even be mapped into
@@ -66,10 +81,11 @@ truncate_complete_page(struct address_sp
        if (page->mapping != mapping)
                return;

+       cancel_dirty_page(page, PAGE_CACHE_SIZE);
+
        if (PagePrivate(page))
                do_invalidatepage(page, 0);

-       clear_page_dirty(page);
        ClearPageUptodate(page);
        ClearPageMappedToDisk(page);
        remove_from_page_cache(page);
@@ -348,7 +364,6 @@ int invalidate_inode_pages2_range(struct
                for (i = 0; !ret && i < pagevec_count(&pvec); i++) {
                        struct page *page = pvec.pages[i];
                        pgoff_t page_index;
-                       int was_dirty;

                        lock_page(page);
                        if (page->mapping != mapping) {
@@ -384,12 +399,8 @@ int invalidate_inode_pages2_range(struct
                                          PAGE_CACHE_SIZE, 0);
                                }
                        }
-                       was_dirty = test_clear_page_dirty(page);
-                       if (!invalidate_complete_page2(mapping, page)) {
-                               if (was_dirty)
-                                       set_page_dirty(page);
+                       if (!invalidate_complete_page2(mapping, page))
                                ret = -EIO;
-                       }
                        unlock_page(page);
                }
                pagevec_release(&pvec);

-- 
Gordon Farquharson

^ permalink raw reply	[flat|nested] 45+ messages in thread
* Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
@ 2006-12-21 20:01 Linus Torvalds
  2006-12-28  0:00 ` Martin Schwidefsky
  0 siblings, 1 reply; 45+ messages in thread
From: Linus Torvalds @ 2006-12-21 20:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: schwidefsky, Martin Michlmayr, Hugh Dickins, Nick Piggin,
	Arjan van de Ven, Andrei Popa, Andrew Morton,
	Linux Kernel Mailing List, Florian Weimer, Marc Haber,
	Heiko Carstens, Arnd Bergmann, gordonfarquharson



On Thu, 21 Dec 2006, Peter Zijlstra wrote:
> 
> Also, I'm dubious about the while thing and stuck a WARN_ON(ret) thing
> at the beginning of the loop. flush_tlb_page() does IPI the other cpus
> to flush their tlb too, so there should not be a SMP race, Arjan?

Now, the reason I think the loop may be needed is:

	CPU#0				CPU#1
	-----				-----
					load old PTE entry
	clear dirty and WP bits
					write to page using old PTE
					NOT CHECKING that the new one
					is write-protected, and just
					setting the dirty bit blindly
					(but atomically)
	flush_tlb_page()
					TLB flushed, but we now have a
					page that is marked dirty and
					unwritable in the page tables,
					and we will mark it clean in
					"struct page *"

Now, the scary thing is, IF a CPU does this, then the way we do all this, 
we may actually have the following sequence:

	CPU#0				CPU#1
	-----				-----
					load old PTE entry
	ptep_clear_flush():
					atomic "set dirty bit" sequence
					PTEP now contains 0000040 !!!
	flush_tlb_page();
					TLB flushed, but PTEP is still 
					"dirty zero"
	write the clear/readonly PTE
					THE DIRTY BIT WAS LOST!

which might actually explain this bug.

I personally _thought_ that Intel CPU's don't actually do an "set dirty 
bit atomically" sequence, but more of a "set dirty bit but trap if the TLB 
is nonpresent" thing, but I have absolutely no proof for that.

Anyway, IF this is the case, then the following patch may or may not fix 
things. It avoids things by never overwriting a PTE entry, not even the 
"cleared" one. It always does an atomic "xchg()" with a valid new entry, 
and looks at the old bits.

What do you guys think? Does something like this work out for S/390 too? I 
tried to make that "ptep_flush_dirty()" concept work for architectures 
that hide the dirty bit somewhere else too, but..

It actually simplifies the architecture-specific code (you just need to 
implement a trivial "ptep_exchange()" and "ptep_flush_dirty()" macro), but 
I only did x86-64 and i386, and while I've booted with this, I haven't 
really given the thing a lot of really _deep_ thought.

But I think this might be safer, as per above.. And it _might_ actually 
explain the problem. Exactly because the "ptep_clear() + blindly assign to 
ptep" might lose a dirty bit that was written by another CPU.

But this really does depend on what a CPU does when it marks a page dirty. 
Does it just blindly write the dirty bit? Or does it actually _validate_ 
that the old page table entry was still present and writable?

This patch makes no assumptions. It should work even if a CPU just writes 
the dirty bit blindly, and the only expectation is that the page tables 
can be accessed atomically (which had _better_ be true on any SMP 
architecture)

Arjan, can you please check within Intel, and ask what the "proper" 
sequence for doing something like this is?

			Linus

----
commit 301d2d53ca0e5d2f61b1c1c259da410c7ee6d6a7
Author: Linus Torvalds <torvalds@woody.osdl.org>
Date:   Thu Dec 21 11:11:05 2006 -0800

    Rewrite the page table "clear dirty and writable" accesses
    
    This is much simpler for most architectures, and allows us to do the
    dirty and writable clear in a single operation without any races or any
    double flushes.
    
    It's also much more careful: we never overwrite the old dirty bits at
    any time, and always make sure to do atomic memory ops to exchange and
    see the old value.
    
    Signed-off-by: Linus Torvalds <torvalds@osdl.org>

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 9d774d0..8879f1d 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -61,31 +61,6 @@ do {				  					  \
 })
 #endif
 
-#ifndef __HAVE_ARCH_PTEP_TEST_AND_CLEAR_DIRTY
-#define ptep_test_and_clear_dirty(__vma, __address, __ptep)		\
-({									\
-	pte_t __pte = *__ptep;						\
-	int r = 1;							\
-	if (!pte_dirty(__pte))						\
-		r = 0;							\
-	else								\
-		set_pte_at((__vma)->vm_mm, (__address), (__ptep),	\
-			   pte_mkclean(__pte));				\
-	r;								\
-})
-#endif
-
-#ifndef __HAVE_ARCH_PTEP_CLEAR_DIRTY_FLUSH
-#define ptep_clear_flush_dirty(__vma, __address, __ptep)		\
-({									\
-	int __dirty;							\
-	__dirty = ptep_test_and_clear_dirty(__vma, __address, __ptep);	\
-	if (__dirty)							\
-		flush_tlb_page(__vma, __address);			\
-	__dirty;							\
-})
-#endif
-
 #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
 #define ptep_get_and_clear(__mm, __address, __ptep)			\
 ({									\
diff --git a/include/asm-i386/pgtable.h b/include/asm-i386/pgtable.h
index e6a4723..b61d6f9 100644
--- a/include/asm-i386/pgtable.h
+++ b/include/asm-i386/pgtable.h
@@ -300,18 +300,20 @@ do {									\
 	flush_tlb_page(vma, address);					\
 } while (0)
 
-#define __HAVE_ARCH_PTEP_CLEAR_DIRTY_FLUSH
-#define ptep_clear_flush_dirty(vma, address, ptep)			\
-({									\
-	int __dirty;							\
-	__dirty = pte_dirty(*(ptep));					\
-	if (__dirty) {							\
-		clear_bit(_PAGE_BIT_DIRTY, &(ptep)->pte_low);		\
-		pte_update_defer((vma)->vm_mm, (address), (ptep));	\
-		flush_tlb_page(vma, address);				\
-	}								\
-	__dirty;							\
-})
+/*
+ * "ptep_exchange()" can be used to atomically change a set of
+ * page table protection bits, returning the old ones (the dirty
+ * and accessed bits in particular, since they are set by hw).
+ *
+ * "ptep_flush_dirty()" then returns the dirty status of the
+ * page (on x86-64, we just look at the dirty bit in the returned
+ * pte, but some other architectures have the dirty bits in
+ * other places than the page tables).
+ */
+#define ptep_exchange(vma, address, ptep, old, new) \
+	(old).pte_low = xchg(&(ptep)->pte_low, (new).pte_low);
+#define ptep_flush_dirty(vma, address, ptep, old) \
+	pte_dirty(old)
 
 #define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
 #define ptep_clear_flush_young(vma, address, ptep)			\
diff --git a/include/asm-x86_64/pgtable.h b/include/asm-x86_64/pgtable.h
index 59901c6..07754b5 100644
--- a/include/asm-x86_64/pgtable.h
+++ b/include/asm-x86_64/pgtable.h
@@ -283,12 +283,20 @@ static inline pte_t pte_clrhuge(pte_t pte)	{ set_pte(&pte, __pte(pte_val(pte) &
 
 struct vm_area_struct;
 
-static inline int ptep_test_and_clear_dirty(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep)
-{
-	if (!pte_dirty(*ptep))
-		return 0;
-	return test_and_clear_bit(_PAGE_BIT_DIRTY, &ptep->pte);
-}
+/*
+ * "ptep_exchange()" can be used to atomically change a set of
+ * page table protection bits, returning the old ones (the dirty
+ * and accessed bits in particular, since they are set by hw).
+ *
+ * "ptep_flush_dirty()" then returns the dirty status of the
+ * page (on x86-64, we just look at the dirty bit in the returned
+ * pte, but some other architectures have the dirty bits in
+ * other places than the page tables).
+ */
+#define ptep_exchange(vma, address, ptep, old, new) \
+	(old).pte = xchg(&(ptep)->pte, (new).pte);
+#define ptep_flush_dirty(vma, address, ptep, old) \
+	pte_dirty(old)
 
 static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep)
 {
diff --git a/mm/rmap.c b/mm/rmap.c
index d8a842a..a028803 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -432,7 +432,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long address;
-	pte_t *pte, entry;
+	pte_t *ptep;
 	spinlock_t *ptl;
 	int ret = 0;
 
@@ -440,22 +440,24 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma)
 	if (address == -EFAULT)
 		goto out;
 
-	pte = page_check_address(page, mm, address, &ptl);
-	if (!pte)
-		goto out;
-
-	if (!pte_dirty(*pte) && !pte_write(*pte))
-		goto unlock;
-
-	entry = ptep_get_and_clear(mm, address, pte);
-	entry = pte_mkclean(entry);
-	entry = pte_wrprotect(entry);
-	ptep_establish(vma, address, pte, entry);
-	lazy_mmu_prot_update(entry);
-	ret = 1;
-
-unlock:
-	pte_unmap_unlock(pte, ptl);
+	ptep = page_check_address(page, mm, address, &ptl);
+	if (ptep) {
+		pte_t old, new;
+
+		old = *ptep;
+		new = pte_wrprotect(pte_mkclean(old));
+		if (!pte_same(old, new)) {
+			for (;;) {
+				flush_cache_page(vma, address, page_to_pfn(page));
+				ptep_exchange(vma, address, ptep, old, new);
+				if (pte_same(old, new))
+					break;
+				ret |= ptep_flush_dirty(vma, address, ptep, old);
+				flush_tlb_page(vma, address);
+			}
+		}
+		pte_unmap_unlock(pte, ptl);
+	}
 out:
 	return ret;
 }

^ permalink raw reply related	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2007-02-02 13:09 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-02-01 22:40 [PATCH] mm: fix page_mkclean_one Mark Groves
2007-02-02  8:42 ` Nick Piggin
2007-02-02 13:08   ` Evgeniy Polyakov
  -- strict thread matches above, loose matches on Subject: below --
2006-12-28 15:50 martin
2006-12-24  8:10 [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) Gordon Farquharson
2006-12-24  8:43 ` Linus Torvalds
2006-12-26 16:17   ` Tobias Diedrich
2006-12-27  4:55     ` [PATCH] mm: fix page_mkclean_one David Miller
2006-12-27  7:00       ` Linus Torvalds
2006-12-27  8:39         ` Andrei Popa
2006-12-28  0:16       ` Linus Torvalds
2006-12-28  0:39         ` Linus Torvalds
2006-12-28  0:52           ` David Miller
2006-12-28  3:04             ` Linus Torvalds
2006-12-28  4:32               ` Gordon Farquharson
2006-12-28  4:53                 ` Linus Torvalds
2006-12-28  5:20                   ` Gordon Farquharson
2006-12-28  5:41                     ` David Miller
2006-12-28  5:47                       ` Gordon Farquharson
2006-12-28 10:13                     ` Russell King
2006-12-28 14:15                       ` Gordon Farquharson
2006-12-28 15:53                         ` Martin Michlmayr
2006-12-28 17:27                       ` Linus Torvalds
2006-12-28 18:44                         ` Russell King
2006-12-28 19:01                           ` Linus Torvalds
     [not found]                   ` <97a0a9ac0612272115g4cce1f08n3c3c8498a6076bd5@mail.gmail.com>
     [not found]                     ` <Pine.LNX.4.64.0612272120180.4473@woody.osdl.org>
2006-12-28  5:38                       ` Gordon Farquharson
2006-12-28  9:30                         ` Martin Michlmayr
2006-12-28 10:16                         ` Martin Michlmayr
2006-12-28 10:49                           ` Russell King
2006-12-28 14:56                             ` Martin Michlmayr
2006-12-28  5:58                       ` Gordon Farquharson
2006-12-28 17:08                         ` Linus Torvalds
2006-12-28  5:55               ` Chen, Kenneth W
2006-12-28  6:10                 ` Chen, Kenneth W
2006-12-28  6:27                   ` David Miller
2006-12-28 17:10                   ` Linus Torvalds
2006-12-28  9:15               ` Zhang, Yanmin
2006-12-28 17:15                 ` Linus Torvalds
2006-12-28 11:50               ` Petri Kaukasoina
2006-12-28 15:09               ` Guillaume Chazarain
2006-12-28 19:19                 ` Guillaume Chazarain
2006-12-28 19:28                   ` Linus Torvalds
2006-12-28 19:45                     ` Andrew Morton
2006-12-28 20:14                       ` Linus Torvalds
2006-12-28 22:38                         ` David Miller
2006-12-29  2:50                           ` Segher Boessenkool
2006-12-29  6:48                             ` Linus Torvalds
2006-12-28 22:35                       ` Mike Galbraith
2006-12-21 20:01 [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3) Linus Torvalds
2006-12-28  0:00 ` Martin Schwidefsky
2006-12-28  0:42   ` Linus Torvalds
2006-12-28  0:52     ` [PATCH] mm: fix page_mkclean_one David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).