linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* NFS BUG_ON in nfs_do_writepage
@ 2009-04-13  5:46 rercola
  2009-04-13  6:50 ` Andrew Morton
  0 siblings, 1 reply; 16+ messages in thread
From: rercola @ 2009-04-13  5:46 UTC (permalink / raw)
  To: linux-kernel

Hi world,
I've got a production server that's running as an NFSv4 client, along with 
a number of other machines.

All the other machines are perfectly happy, but this one is a bit of a 
bother. It's got a Core 2 Duo 6700, with a D975XBX2 motherboard and 4 GB 
of ECC RAM.

The problem is that, under heavy load, NFS will trip a BUG_ON in 
nfs_do_writepage, as follows:
------------[ cut here ]------------
kernel BUG at fs/nfs/write.c:252!
invalid opcode: 0000 [#1] SMP
last sysfs file: /sys/devices/virtual/block/dm-
0/range
CPU 0
Modules linked in: fuse autofs4 coretemp hwmon nfs lockd nfs_acl 
auth_rpcgss sunrpc ipv6 cpufreq_ondemand acpi_cpufreq freq_table kvm_intel 
kvm snd_hda_codec_idt snd_hda_intel snd_hda_codec snd_hwdep snd_seq_dummy 
snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss 
snd_mixer_oss snd_pcm snd_timer usb_storage snd cpia_usb e1000e soundcore 
cpia ppdev firewire_ohci snd_page_alloc firewire_core i2c_i801 videodev 
parport_pc pcspkr iTCO_wdt i2c_core v4l1_compat crc_itu_t parport 
iTCO_vendor_support v4l2_compat_ioctl32 i82975x_edac edac_core raid1
Pid: 309, comm: pdflush Not tainted 2.6.29.1 #1
RIP: 0010:[<ffffffffa0291a47>]  [<ffffffffa0291a47>] 
nfs_do_writepage+0x106/0x1a2 [nfs]
RSP: 0018:ffff88012d805af0  EFLAGS: 00010282
RAX: 0000000000000001 RBX: ffffe20001f66878 RCX: 0000000000000015
RDX: 0000000000600020 RSI: 0000000000000000 RDI: ffff88000155789c
RBP: ffff88012d805b20 R08: ffff88012cd53460 R09: 0000000000000004
R10: ffff88009d421700 R11: ffffffffa02a98d0 R12: ffff88010253a300
R13: ffff88000155789c R14: ffffe20001f66878 R15: ffff88012d805c80
FS:  0000000000000000(0000) GS:ffffffff817df000(0000) 
knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00000000f7d2b000 CR3: 000000008708a000 CR4: 00000000000026e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process pdflush (pid: 309, threadinfo ffff88012d804000, task 
ffff88012e4fdb80)
Stack:
  ffff88012d805b20 ffffe20001f66878 ffffe20001f66878 0000000000000000
  0000000000000001 0000000000000000 ffff88012d805b40 ffffffffa0291f5a
  ffffe20001f66878 ffff88012d805e40 ffff88012d805c70 ffffffff810a9c1d
Call Trace:
  [<ffffffffa0291f5a>] nfs_writepages_callback+0x14/0x25 [nfs]
  [<ffffffff810a9c1d>] write_cache_pages+0x261/0x3a4
  [<ffffffffa0291f46>] ? nfs_writepages_callback+0x0/0x25 [nfs]
  [<ffffffffa0291f1c>] nfs_writepages+0xb5/0xdf [nfs]
  [<ffffffffa02932bd>] ? nfs_flush_one+0x0/0xeb [nfs]
  [<ffffffff81060f78>] ? bit_waitqueue+0x17/0xa4
  [<ffffffff810a9db7>] do_writepages+0x2d/0x3d
  [<ffffffff810f4a51>] __writeback_single_inode+0x1b2/0x347
  [<ffffffff8100f7d4>] ? __switch_to+0xbe/0x3eb
  [<ffffffff810f4ffb>] generic_sync_sb_inodes+0x24a/0x395
  [<ffffffff810f5354>] writeback_inodes+0xa9/0x102
  [<ffffffff810a9f26>] wb_kupdate+0xa8/0x11e
  [<ffffffff810aac9d>] pdflush+0x173/0x236
  [<ffffffff810a9e7e>] ? wb_kupdate+0x0/0x11e
  [<ffffffff810aab2a>] ? pdflush+0x0/0x236
  [<ffffffff810aab2a>] ? pdflush+0x0/0x236
  [<ffffffff81060c9e>] kthread+0x4e/0x7b
  [<ffffffff810126ca>] child_rip+0xa/0x20
  [<ffffffff81011fe7>] ? restore_args+0x0/0x30
  [<ffffffff81060c50>] ? kthread+0x0/0x7b
  [<ffffffff810126c0>] ? child_rip+0x0/0x20
Code: 89 e7 e8 d5 cc ff ff 4c 89 e7 89 c3 e8 2a cd ff ff 85 db 74 a0 e9 83 
00 00 00 41 f6 44 24 40 02 74 0d 4c 89 ef e8 e2 a5 d9 e0 90 <0f> 0b eb fe 
4c 89 f7 e8 f5 7a e1 e0 85 c0 75 49 49 8b 46 18 ba
RIP  [<ffffffffa0291a47>] nfs_do_writepage+0x106/0x1a2 [nfs]
  RSP <ffff88012d805af0>
---[ end trace 6d60c9b253ebcf15 ]---

64bit kernel, 32bit userland. 2.6.29.1 vanilla, bug occurred as early as 
2.6.28, bug still occurs with 2.6.30-rc1. I'm running bisect now, but 
there's a limit on how often I can reboot a production server, so I'll 
report back when I find it.

The unfortunate part, of course, is that when this bug occurs, the 
writepage never returns...meaning that the process in question is 
permanently locked in la-la-land (AKA state D). This renders this 
unfortunate bug a bit...inconvenient.

[No other clients, or the server, report anything interesting when this 
happens, AFAICS.]

- Rich
-- 
Ah say, son, you're about as sharp as a bowlin' ball.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: NFS BUG_ON in nfs_do_writepage
  2009-04-13  5:46 NFS BUG_ON in nfs_do_writepage rercola
@ 2009-04-13  6:50 ` Andrew Morton
  2009-04-13 19:16   ` Trond Myklebust
  0 siblings, 1 reply; 16+ messages in thread
From: Andrew Morton @ 2009-04-13  6:50 UTC (permalink / raw)
  To: rercola; +Cc: linux-kernel, linux-nfs

(cc linux-nfs)

On Mon, 13 Apr 2009 01:46:24 -0400 (EDT) rercola@acm.jhu.edu wrote:

> Hi world,
> I've got a production server that's running as an NFSv4 client, along with 
> a number of other machines.
> 
> All the other machines are perfectly happy, but this one is a bit of a 
> bother. It's got a Core 2 Duo 6700, with a D975XBX2 motherboard and 4 GB 
> of ECC RAM.
> 
> The problem is that, under heavy load, NFS will trip a BUG_ON in 
> nfs_do_writepage, as follows:
> ------------[ cut here ]------------
> kernel BUG at fs/nfs/write.c:252!
> invalid opcode: 0000 [#1] SMP
> last sysfs file: /sys/devices/virtual/block/dm-
> 0/range
> CPU 0
> Modules linked in: fuse autofs4 coretemp hwmon nfs lockd nfs_acl 
> auth_rpcgss sunrpc ipv6 cpufreq_ondemand acpi_cpufreq freq_table kvm_intel 
> kvm snd_hda_codec_idt snd_hda_intel snd_hda_codec snd_hwdep snd_seq_dummy 
> snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss 
> snd_mixer_oss snd_pcm snd_timer usb_storage snd cpia_usb e1000e soundcore 
> cpia ppdev firewire_ohci snd_page_alloc firewire_core i2c_i801 videodev 
> parport_pc pcspkr iTCO_wdt i2c_core v4l1_compat crc_itu_t parport 
> iTCO_vendor_support v4l2_compat_ioctl32 i82975x_edac edac_core raid1
> Pid: 309, comm: pdflush Not tainted 2.6.29.1 #1
> RIP: 0010:[<ffffffffa0291a47>]  [<ffffffffa0291a47>] 
> nfs_do_writepage+0x106/0x1a2 [nfs]
> RSP: 0018:ffff88012d805af0  EFLAGS: 00010282
> RAX: 0000000000000001 RBX: ffffe20001f66878 RCX: 0000000000000015
> RDX: 0000000000600020 RSI: 0000000000000000 RDI: ffff88000155789c
> RBP: ffff88012d805b20 R08: ffff88012cd53460 R09: 0000000000000004
> R10: ffff88009d421700 R11: ffffffffa02a98d0 R12: ffff88010253a300
> R13: ffff88000155789c R14: ffffe20001f66878 R15: ffff88012d805c80
> FS:  0000000000000000(0000) GS:ffffffff817df000(0000) 
> knlGS:0000000000000000
> CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> CR2: 00000000f7d2b000 CR3: 000000008708a000 CR4: 00000000000026e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process pdflush (pid: 309, threadinfo ffff88012d804000, task 
> ffff88012e4fdb80)
> Stack:
>   ffff88012d805b20 ffffe20001f66878 ffffe20001f66878 0000000000000000
>   0000000000000001 0000000000000000 ffff88012d805b40 ffffffffa0291f5a
>   ffffe20001f66878 ffff88012d805e40 ffff88012d805c70 ffffffff810a9c1d
> Call Trace:
>   [<ffffffffa0291f5a>] nfs_writepages_callback+0x14/0x25 [nfs]
>   [<ffffffff810a9c1d>] write_cache_pages+0x261/0x3a4
>   [<ffffffffa0291f46>] ? nfs_writepages_callback+0x0/0x25 [nfs]
>   [<ffffffffa0291f1c>] nfs_writepages+0xb5/0xdf [nfs]
>   [<ffffffffa02932bd>] ? nfs_flush_one+0x0/0xeb [nfs]
>   [<ffffffff81060f78>] ? bit_waitqueue+0x17/0xa4
>   [<ffffffff810a9db7>] do_writepages+0x2d/0x3d
>   [<ffffffff810f4a51>] __writeback_single_inode+0x1b2/0x347
>   [<ffffffff8100f7d4>] ? __switch_to+0xbe/0x3eb
>   [<ffffffff810f4ffb>] generic_sync_sb_inodes+0x24a/0x395
>   [<ffffffff810f5354>] writeback_inodes+0xa9/0x102
>   [<ffffffff810a9f26>] wb_kupdate+0xa8/0x11e
>   [<ffffffff810aac9d>] pdflush+0x173/0x236
>   [<ffffffff810a9e7e>] ? wb_kupdate+0x0/0x11e
>   [<ffffffff810aab2a>] ? pdflush+0x0/0x236
>   [<ffffffff810aab2a>] ? pdflush+0x0/0x236
>   [<ffffffff81060c9e>] kthread+0x4e/0x7b
>   [<ffffffff810126ca>] child_rip+0xa/0x20
>   [<ffffffff81011fe7>] ? restore_args+0x0/0x30
>   [<ffffffff81060c50>] ? kthread+0x0/0x7b
>   [<ffffffff810126c0>] ? child_rip+0x0/0x20
> Code: 89 e7 e8 d5 cc ff ff 4c 89 e7 89 c3 e8 2a cd ff ff 85 db 74 a0 e9 83 
> 00 00 00 41 f6 44 24 40 02 74 0d 4c 89 ef e8 e2 a5 d9 e0 90 <0f> 0b eb fe 
> 4c 89 f7 e8 f5 7a e1 e0 85 c0 75 49 49 8b 46 18 ba
> RIP  [<ffffffffa0291a47>] nfs_do_writepage+0x106/0x1a2 [nfs]
>   RSP <ffff88012d805af0>
> ---[ end trace 6d60c9b253ebcf15 ]---
> 
> 64bit kernel, 32bit userland. 2.6.29.1 vanilla, bug occurred as early as 
> 2.6.28, bug still occurs with 2.6.30-rc1. I'm running bisect now, but 
> there's a limit on how often I can reboot a production server, so I'll 
> report back when I find it.
> 
> The unfortunate part, of course, is that when this bug occurs, the 
> writepage never returns...meaning that the process in question is 
> permanently locked in la-la-land (AKA state D). This renders this 
> unfortunate bug a bit...inconvenient.
> 
> [No other clients, or the server, report anything interesting when this 
> happens, AFAICS.]
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: NFS BUG_ON in nfs_do_writepage
  2009-04-13  6:50 ` Andrew Morton
@ 2009-04-13 19:16   ` Trond Myklebust
  2009-04-13 22:06     ` Rince
  0 siblings, 1 reply; 16+ messages in thread
From: Trond Myklebust @ 2009-04-13 19:16 UTC (permalink / raw)
  To: Andrew Morton; +Cc: rercola, linux-kernel, linux-nfs

[-- Attachment #1: Type: text/plain, Size: 4612 bytes --]

On Sun, 2009-04-12 at 23:50 -0700, Andrew Morton wrote:
> (cc linux-nfs)
> 
> On Mon, 13 Apr 2009 01:46:24 -0400 (EDT) rercola@acm.jhu.edu wrote:
> 
> > Hi world,
> > I've got a production server that's running as an NFSv4 client, along with 
> > a number of other machines.
> > 
> > All the other machines are perfectly happy, but this one is a bit of a 
> > bother. It's got a Core 2 Duo 6700, with a D975XBX2 motherboard and 4 GB 
> > of ECC RAM.
> > 
> > The problem is that, under heavy load, NFS will trip a BUG_ON in 
> > nfs_do_writepage, as follows:
> > ------------[ cut here ]------------
> > kernel BUG at fs/nfs/write.c:252!
> > invalid opcode: 0000 [#1] SMP
> > last sysfs file: /sys/devices/virtual/block/dm-
> > 0/range
> > CPU 0
> > Modules linked in: fuse autofs4 coretemp hwmon nfs lockd nfs_acl 
> > auth_rpcgss sunrpc ipv6 cpufreq_ondemand acpi_cpufreq freq_table kvm_intel 
> > kvm snd_hda_codec_idt snd_hda_intel snd_hda_codec snd_hwdep snd_seq_dummy 
> > snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss 
> > snd_mixer_oss snd_pcm snd_timer usb_storage snd cpia_usb e1000e soundcore 
> > cpia ppdev firewire_ohci snd_page_alloc firewire_core i2c_i801 videodev 
> > parport_pc pcspkr iTCO_wdt i2c_core v4l1_compat crc_itu_t parport 
> > iTCO_vendor_support v4l2_compat_ioctl32 i82975x_edac edac_core raid1
> > Pid: 309, comm: pdflush Not tainted 2.6.29.1 #1
> > RIP: 0010:[<ffffffffa0291a47>]  [<ffffffffa0291a47>] 
> > nfs_do_writepage+0x106/0x1a2 [nfs]
> > RSP: 0018:ffff88012d805af0  EFLAGS: 00010282
> > RAX: 0000000000000001 RBX: ffffe20001f66878 RCX: 0000000000000015
> > RDX: 0000000000600020 RSI: 0000000000000000 RDI: ffff88000155789c
> > RBP: ffff88012d805b20 R08: ffff88012cd53460 R09: 0000000000000004
> > R10: ffff88009d421700 R11: ffffffffa02a98d0 R12: ffff88010253a300
> > R13: ffff88000155789c R14: ffffe20001f66878 R15: ffff88012d805c80
> > FS:  0000000000000000(0000) GS:ffffffff817df000(0000) 
> > knlGS:0000000000000000
> > CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> > CR2: 00000000f7d2b000 CR3: 000000008708a000 CR4: 00000000000026e0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > Process pdflush (pid: 309, threadinfo ffff88012d804000, task 
> > ffff88012e4fdb80)
> > Stack:
> >   ffff88012d805b20 ffffe20001f66878 ffffe20001f66878 0000000000000000
> >   0000000000000001 0000000000000000 ffff88012d805b40 ffffffffa0291f5a
> >   ffffe20001f66878 ffff88012d805e40 ffff88012d805c70 ffffffff810a9c1d
> > Call Trace:
> >   [<ffffffffa0291f5a>] nfs_writepages_callback+0x14/0x25 [nfs]
> >   [<ffffffff810a9c1d>] write_cache_pages+0x261/0x3a4
> >   [<ffffffffa0291f46>] ? nfs_writepages_callback+0x0/0x25 [nfs]
> >   [<ffffffffa0291f1c>] nfs_writepages+0xb5/0xdf [nfs]
> >   [<ffffffffa02932bd>] ? nfs_flush_one+0x0/0xeb [nfs]
> >   [<ffffffff81060f78>] ? bit_waitqueue+0x17/0xa4
> >   [<ffffffff810a9db7>] do_writepages+0x2d/0x3d
> >   [<ffffffff810f4a51>] __writeback_single_inode+0x1b2/0x347
> >   [<ffffffff8100f7d4>] ? __switch_to+0xbe/0x3eb
> >   [<ffffffff810f4ffb>] generic_sync_sb_inodes+0x24a/0x395
> >   [<ffffffff810f5354>] writeback_inodes+0xa9/0x102
> >   [<ffffffff810a9f26>] wb_kupdate+0xa8/0x11e
> >   [<ffffffff810aac9d>] pdflush+0x173/0x236
> >   [<ffffffff810a9e7e>] ? wb_kupdate+0x0/0x11e
> >   [<ffffffff810aab2a>] ? pdflush+0x0/0x236
> >   [<ffffffff810aab2a>] ? pdflush+0x0/0x236
> >   [<ffffffff81060c9e>] kthread+0x4e/0x7b
> >   [<ffffffff810126ca>] child_rip+0xa/0x20
> >   [<ffffffff81011fe7>] ? restore_args+0x0/0x30
> >   [<ffffffff81060c50>] ? kthread+0x0/0x7b
> >   [<ffffffff810126c0>] ? child_rip+0x0/0x20
> > Code: 89 e7 e8 d5 cc ff ff 4c 89 e7 89 c3 e8 2a cd ff ff 85 db 74 a0 e9 83 
> > 00 00 00 41 f6 44 24 40 02 74 0d 4c 89 ef e8 e2 a5 d9 e0 90 <0f> 0b eb fe 
> > 4c 89 f7 e8 f5 7a e1 e0 85 c0 75 49 49 8b 46 18 ba
> > RIP  [<ffffffffa0291a47>] nfs_do_writepage+0x106/0x1a2 [nfs]
> >   RSP <ffff88012d805af0>
> > ---[ end trace 6d60c9b253ebcf15 ]---
> > 
> > 64bit kernel, 32bit userland. 2.6.29.1 vanilla, bug occurred as early as 
> > 2.6.28, bug still occurs with 2.6.30-rc1. I'm running bisect now, but 
> > there's a limit on how often I can reboot a production server, so I'll 
> > report back when I find it.

This looks like the same issue as
    http://bugzilla.kernel.org/show_bug.cgi?id=12913

Does the following patch from Nick help? It should apply on top of
2.6.30-rc1, and fixes a race in which the VM layer can end up marking a
page dirty after the filesystem has cleaned it...

Cheers
  Trond


[-- Attachment #2: Attached message - [patch] mm: close page_mkwrite races --]
[-- Type: message/rfc822, Size: 9801 bytes --]

From: Nick Piggin <npiggin@suse.de>
To: Andrew Morton <akpm@linux-foundation.org>, Sage Weil <sage@newdream.net>, Trond Myklebust <trond.myklebust@fys.uio.no>, linux-fsdevel@vger.kernel.org, linux-mm@vger.kernel.org
Subject: [patch] mm: close page_mkwrite races
Date: Mon, 30 Mar 2009 15:53:07 +0200
Message-ID: <20090330135307.GP31000@wotan.suse.de>

Hi,

I'd like opinions on this patch (applies on top of the previous
page_mkwrite fixes in -mm). I was not going to ask to merge it
immediately however it appears that fsblock is not the only one who
needs it...

--
I want to have the page be protected by page lock between page_mkwrite
notification to the filesystem, and the actual setting of the page
dirty. Do this by allowing the filesystem to return a locked page from
page_mkwrite, and have the page fault code keep it held until after it
calls set_page_dirty.

I need this in fsblock because I am working to ensure filesystem metadata
can be correctly allocated and refcounted. This means that page cleaning
should not require memory allocation.

Without this patch, then for example we could get a concurrent writeout
after the page_mkwrite (which allocates page metadata required to clean
it), but before the set_page_dirty. The writeout will clean the page and
notice that the metadata is now unused and may be deallocated (because
it appears clean as set_page_dirty hasn't been called yet). So at this
point the page may be dirtied via the pte without enough metadata to be
able to write it back.

Sage needs this race closed for ceph, and Trond maybe for NFS.

Cc: Sage Weil <sage@newdream.net>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Nick Piggin <npiggin@suse.de>

---
 Documentation/filesystems/Locking |   24 +++++++---
 fs/buffer.c                       |   10 ++--
 mm/memory.c                       |   83 ++++++++++++++++++++++++++------------
 3 files changed, 79 insertions(+), 38 deletions(-)

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -2480,7 +2480,8 @@ block_page_mkwrite(struct vm_area_struct
 	if ((page->mapping != inode->i_mapping) ||
 	    (page_offset(page) > size)) {
 		/* page got truncated out from underneath us */
-		goto out_unlock;
+		unlock_page(page);
+		goto out;
 	}
 
 	/* page is wholly or partially inside EOF */
@@ -2494,14 +2495,15 @@ block_page_mkwrite(struct vm_area_struct
 		ret = block_commit_write(page, 0, end);
 
 	if (unlikely(ret)) {
+		unlock_page(page);
 		if (ret == -ENOMEM)
 			ret = VM_FAULT_OOM;
 		else /* -ENOSPC, -EIO, etc */
 			ret = VM_FAULT_SIGBUS;
-	}
+	} else
+		ret = VM_FAULT_LOCKED;
 
-out_unlock:
-	unlock_page(page);
+out:
 	return ret;
 }
 
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -1964,6 +1964,15 @@ static int do_wp_page(struct mm_struct *
 				ret = tmp;
 				goto unwritable_page;
 			}
+			if (unlikely(!(tmp & VM_FAULT_LOCKED))) {
+				lock_page(old_page);
+				if (!old_page->mapping) {
+					ret = 0; /* retry the fault */
+					unlock_page(old_page);
+					goto unwritable_page;
+				}
+			} else
+				VM_BUG_ON(!PageLocked(old_page));
 
 			/*
 			 * Since we dropped the lock we need to revalidate
@@ -1973,9 +1982,11 @@ static int do_wp_page(struct mm_struct *
 			 */
 			page_table = pte_offset_map_lock(mm, pmd, address,
 							 &ptl);
-			page_cache_release(old_page);
-			if (!pte_same(*page_table, orig_pte))
+			if (!pte_same(*page_table, orig_pte)) {
+				page_cache_release(old_page);
+				unlock_page(old_page);
 				goto unlock;
+			}
 
 			page_mkwrite = 1;
 		}
@@ -2098,16 +2109,30 @@ unlock:
 		 *
 		 * do_no_page is protected similarly.
 		 */
-		wait_on_page_locked(dirty_page);
-		set_page_dirty_balance(dirty_page, page_mkwrite);
+		if (!page_mkwrite) {
+			wait_on_page_locked(dirty_page);
+			set_page_dirty_balance(dirty_page, page_mkwrite);
+		}
 		put_page(dirty_page);
+		if (page_mkwrite) {
+			struct address_space *mapping = old_page->mapping;
+
+			unlock_page(old_page);
+			page_cache_release(old_page);
+			balance_dirty_pages_ratelimited(mapping);
+		}
 	}
 	return ret;
 oom_free_new:
 	page_cache_release(new_page);
 oom:
-	if (old_page)
+	if (old_page) {
+		if (page_mkwrite) {
+			unlock_page(old_page);
+			page_cache_release(old_page);
+		}
 		page_cache_release(old_page);
+	}
 	return VM_FAULT_OOM;
 
 unwritable_page:
@@ -2659,27 +2684,22 @@ static int __do_fault(struct mm_struct *
 				int tmp;
 
 				unlock_page(page);
-				vmf.flags |= FAULT_FLAG_MKWRITE;
+				vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
 				tmp = vma->vm_ops->page_mkwrite(vma, &vmf);
 				if (unlikely(tmp &
 					  (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
 					ret = tmp;
-					anon = 1; /* no anon but release vmf.page */
-					goto out_unlocked;
-				}
-				lock_page(page);
-				/*
-				 * XXX: this is not quite right (racy vs
-				 * invalidate) to unlock and relock the page
-				 * like this, however a better fix requires
-				 * reworking page_mkwrite locking API, which
-				 * is better done later.
-				 */
-				if (!page->mapping) {
-					ret = 0;
-					anon = 1; /* no anon but release vmf.page */
-					goto out;
+					goto unwritable_page;
 				}
+				if (unlikely(!(tmp & VM_FAULT_LOCKED))) {
+					lock_page(page);
+					if (!page->mapping) {
+						ret = 0; /* retry the fault */
+						unlock_page(page);
+						goto unwritable_page;
+					}
+				} else
+					VM_BUG_ON(!PageLocked(page));
 				page_mkwrite = 1;
 			}
 		}
@@ -2731,19 +2751,30 @@ static int __do_fault(struct mm_struct *
 	pte_unmap_unlock(page_table, ptl);
 
 out:
-	unlock_page(vmf.page);
-out_unlocked:
-	if (anon)
-		page_cache_release(vmf.page);
-	else if (dirty_page) {
+	if (dirty_page) {
+		struct address_space *mapping = page->mapping;
+
 		if (vma->vm_file)
 			file_update_time(vma->vm_file);
 
+		if (set_page_dirty(dirty_page))
+			page_mkwrite = 1;
 		set_page_dirty_balance(dirty_page, page_mkwrite);
+		unlock_page(dirty_page);
 		put_page(dirty_page);
+		if (page_mkwrite)
+			balance_dirty_pages_ratelimited(mapping);
+	} else {
+		unlock_page(vmf.page);
+		if (anon)
+			page_cache_release(vmf.page);
 	}
 
 	return ret;
+
+unwritable_page:
+	page_cache_release(page);
+	return ret;
 }
 
 static int do_linear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
Index: linux-2.6/Documentation/filesystems/Locking
===================================================================
--- linux-2.6.orig/Documentation/filesystems/Locking
+++ linux-2.6/Documentation/filesystems/Locking
@@ -509,16 +509,24 @@ locking rules:
 		BKL	mmap_sem	PageLocked(page)
 open:		no	yes
 close:		no	yes
-fault:		no	yes
-page_mkwrite:	no	yes		no
+fault:		no	yes		can return with page locked
+page_mkwrite:	no	yes		can return with page locked
 access:		no	yes
 
-	->page_mkwrite() is called when a previously read-only page is
-about to become writeable. The file system is responsible for
-protecting against truncate races. Once appropriate action has been
-taking to lock out truncate, the page range should be verified to be
-within i_size. The page mapping should also be checked that it is not
-NULL.
+	->fault() is called when a previously not present pte is about
+to be faulted in. The filesystem must find and return the page associated
+with the passed in "pgoff" in the vm_fault structure. If it is possible that
+the page may be truncated and/or invalidated, then the filesystem must lock
+the page, then ensure it is not already truncated (the page lock will block
+subsequent truncate), and then return with VM_FAULT_LOCKED, and the page
+locked. The VM will unlock the page.
+
+	->page_mkwrite() is called when a previously read-only pte is
+about to become writeable. The filesystem again must ensure that there are
+no truncate/invalidate races, and then return with the page locked. If
+the page has been truncated, the filesystem should not look up a new page
+like the ->fault() handler, but simply return with VM_FAULT_NOPAGE, which
+will cause the VM to retry the fault.
 
 	->access() is called when get_user_pages() fails in
 acces_process_vm(), typically used to debug a process through

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: NFS BUG_ON in nfs_do_writepage
  2009-04-13 19:16   ` Trond Myklebust
@ 2009-04-13 22:06     ` Rince
  2009-04-13 23:44       ` Rince
  0 siblings, 1 reply; 16+ messages in thread
From: Rince @ 2009-04-13 22:06 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Andrew Morton, linux-kernel, linux-nfs

On Mon, Apr 13, 2009 at 3:16 PM, Trond Myklebust
<trond.myklebust@fys.uio.no> wrote:
> On Sun, 2009-04-12 at 23:50 -0700, Andrew Morton wrote:
>> (cc linux-nfs)
>>
>> On Mon, 13 Apr 2009 01:46:24 -0400 (EDT) rercola@acm.jhu.edu wrote:
>>
>> > Hi world,
>> > I've got a production server that's running as an NFSv4 client, along with
>> > a number of other machines.
>> >
>> > All the other machines are perfectly happy, but this one is a bit of a
>> > bother. It's got a Core 2 Duo 6700, with a D975XBX2 motherboard and 4 GB
>> > of ECC RAM.
>> >
>> > The problem is that, under heavy load, NFS will trip a BUG_ON in
>> > nfs_do_writepage, as follows:
>> > ------------[ cut here ]------------
>> > kernel BUG at fs/nfs/write.c:252!
>> > invalid opcode: 0000 [#1] SMP
>> > last sysfs file: /sys/devices/virtual/block/dm-
>> > 0/range
>> > CPU 0
>> > Modules linked in: fuse autofs4 coretemp hwmon nfs lockd nfs_acl
>> > auth_rpcgss sunrpc ipv6 cpufreq_ondemand acpi_cpufreq freq_table kvm_intel
>> > kvm snd_hda_codec_idt snd_hda_intel snd_hda_codec snd_hwdep snd_seq_dummy
>> > snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss
>> > snd_mixer_oss snd_pcm snd_timer usb_storage snd cpia_usb e1000e soundcore
>> > cpia ppdev firewire_ohci snd_page_alloc firewire_core i2c_i801 videodev
>> > parport_pc pcspkr iTCO_wdt i2c_core v4l1_compat crc_itu_t parport
>> > iTCO_vendor_support v4l2_compat_ioctl32 i82975x_edac edac_core raid1
>> > Pid: 309, comm: pdflush Not tainted 2.6.29.1 #1
>> > RIP: 0010:[<ffffffffa0291a47>]  [<ffffffffa0291a47>]
>> > nfs_do_writepage+0x106/0x1a2 [nfs]
>> > RSP: 0018:ffff88012d805af0  EFLAGS: 00010282
>> > RAX: 0000000000000001 RBX: ffffe20001f66878 RCX: 0000000000000015
>> > RDX: 0000000000600020 RSI: 0000000000000000 RDI: ffff88000155789c
>> > RBP: ffff88012d805b20 R08: ffff88012cd53460 R09: 0000000000000004
>> > R10: ffff88009d421700 R11: ffffffffa02a98d0 R12: ffff88010253a300
>> > R13: ffff88000155789c R14: ffffe20001f66878 R15: ffff88012d805c80
>> > FS:  0000000000000000(0000) GS:ffffffff817df000(0000)
>> > knlGS:0000000000000000
>> > CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
>> > CR2: 00000000f7d2b000 CR3: 000000008708a000 CR4: 00000000000026e0
>> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> > Process pdflush (pid: 309, threadinfo ffff88012d804000, task
>> > ffff88012e4fdb80)
>> > Stack:
>> >   ffff88012d805b20 ffffe20001f66878 ffffe20001f66878 0000000000000000
>> >   0000000000000001 0000000000000000 ffff88012d805b40 ffffffffa0291f5a
>> >   ffffe20001f66878 ffff88012d805e40 ffff88012d805c70 ffffffff810a9c1d
>> > Call Trace:
>> >   [<ffffffffa0291f5a>] nfs_writepages_callback+0x14/0x25 [nfs]
>> >   [<ffffffff810a9c1d>] write_cache_pages+0x261/0x3a4
>> >   [<ffffffffa0291f46>] ? nfs_writepages_callback+0x0/0x25 [nfs]
>> >   [<ffffffffa0291f1c>] nfs_writepages+0xb5/0xdf [nfs]
>> >   [<ffffffffa02932bd>] ? nfs_flush_one+0x0/0xeb [nfs]
>> >   [<ffffffff81060f78>] ? bit_waitqueue+0x17/0xa4
>> >   [<ffffffff810a9db7>] do_writepages+0x2d/0x3d
>> >   [<ffffffff810f4a51>] __writeback_single_inode+0x1b2/0x347
>> >   [<ffffffff8100f7d4>] ? __switch_to+0xbe/0x3eb
>> >   [<ffffffff810f4ffb>] generic_sync_sb_inodes+0x24a/0x395
>> >   [<ffffffff810f5354>] writeback_inodes+0xa9/0x102
>> >   [<ffffffff810a9f26>] wb_kupdate+0xa8/0x11e
>> >   [<ffffffff810aac9d>] pdflush+0x173/0x236
>> >   [<ffffffff810a9e7e>] ? wb_kupdate+0x0/0x11e
>> >   [<ffffffff810aab2a>] ? pdflush+0x0/0x236
>> >   [<ffffffff810aab2a>] ? pdflush+0x0/0x236
>> >   [<ffffffff81060c9e>] kthread+0x4e/0x7b
>> >   [<ffffffff810126ca>] child_rip+0xa/0x20
>> >   [<ffffffff81011fe7>] ? restore_args+0x0/0x30
>> >   [<ffffffff81060c50>] ? kthread+0x0/0x7b
>> >   [<ffffffff810126c0>] ? child_rip+0x0/0x20
>> > Code: 89 e7 e8 d5 cc ff ff 4c 89 e7 89 c3 e8 2a cd ff ff 85 db 74 a0 e9 83
>> > 00 00 00 41 f6 44 24 40 02 74 0d 4c 89 ef e8 e2 a5 d9 e0 90 <0f> 0b eb fe
>> > 4c 89 f7 e8 f5 7a e1 e0 85 c0 75 49 49 8b 46 18 ba
>> > RIP  [<ffffffffa0291a47>] nfs_do_writepage+0x106/0x1a2 [nfs]
>> >   RSP <ffff88012d805af0>
>> > ---[ end trace 6d60c9b253ebcf15 ]---
>> >
>> > 64bit kernel, 32bit userland. 2.6.29.1 vanilla, bug occurred as early as
>> > 2.6.28, bug still occurs with 2.6.30-rc1. I'm running bisect now, but
>> > there's a limit on how often I can reboot a production server, so I'll
>> > report back when I find it.
>
> This looks like the same issue as
>    http://bugzilla.kernel.org/show_bug.cgi?id=12913
>
> Does the following patch from Nick help? It should apply on top of
> 2.6.30-rc1, and fixes a race in which the VM layer can end up marking a
> page dirty after the filesystem has cleaned it...
>
> Cheers
>  Trond
>

Patch does not apply cleanly to 2.6.30-rc1. Working on this.

- Rich
-- 

I'm so miserable without you, it's almost like you're here.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: NFS BUG_ON in nfs_do_writepage
  2009-04-13 22:06     ` Rince
@ 2009-04-13 23:44       ` Rince
  2009-04-24  9:26         ` Rince
  0 siblings, 1 reply; 16+ messages in thread
From: Rince @ 2009-04-13 23:44 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Andrew Morton, linux-kernel, linux-nfs

On Mon, Apr 13, 2009 at 6:06 PM, Rince <rincebrain@gmail.com> wrote:
> Patch does not apply cleanly to 2.6.30-rc1. Working on this.

Okay, so I'm having trouble getting this to apply to anything at all.
I can see it requires a previous patch by Nick, but even in the thread
it's mentioned that it doesn't apply cleanly.

Is there a canonical form that I can apply somewhere?

- Rich

PS: I rather expect my report here to be the same as that Bugzilla
bug, since I filed it. :)
-- 

If we were meant to get up early, God would have created us with alarm clocks.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: NFS BUG_ON in nfs_do_writepage
  2009-04-13 23:44       ` Rince
@ 2009-04-24  9:26         ` Rince
  2009-04-24 14:14           ` Trond Myklebust
  2009-04-25 14:57           ` Trond Myklebust
  0 siblings, 2 replies; 16+ messages in thread
From: Rince @ 2009-04-24  9:26 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Andrew Morton, linux-kernel, linux-nfs

Applied try 3 of Nick Piggin's patch to 2.6.30-rc3 (cleanly, no less!)

Doesn't appear to have helped at all - I received my favorite BUG ON
write.c:252 just like always, within 24 hours of booting the kernel,
even.

- Rich
-- 

Keep out of reach of children.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: NFS BUG_ON in nfs_do_writepage
  2009-04-24  9:26         ` Rince
@ 2009-04-24 14:14           ` Trond Myklebust
  2009-04-25 14:57           ` Trond Myklebust
  1 sibling, 0 replies; 16+ messages in thread
From: Trond Myklebust @ 2009-04-24 14:14 UTC (permalink / raw)
  To: Rince; +Cc: Andrew Morton, linux-kernel, linux-nfs

On Fri, 2009-04-24 at 05:26 -0400, Rince wrote:
> Applied try 3 of Nick Piggin's patch to 2.6.30-rc3 (cleanly, no less!)
> 
> Doesn't appear to have helped at all - I received my favorite BUG ON
> write.c:252 just like always, within 24 hours of booting the kernel,
> even.
> 
> - Rich

Yes. I'm able to reproduce this bug now. I found that I can trigger it
fairly reliably by running iozone in something like the following
configuration: "iozone -t 32 -r 1k -s 256m -c -D -K -Z -i0 -i8"

I believe it boils down to the problems being discussed in the following
thread:
   http://thread.gmane.org/gmane.linux.nfs/26181

Cheers
  Trond


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: NFS BUG_ON in nfs_do_writepage
  2009-04-24  9:26         ` Rince
  2009-04-24 14:14           ` Trond Myklebust
@ 2009-04-25 14:57           ` Trond Myklebust
  2009-04-26  6:40             ` Nick Piggin
  1 sibling, 1 reply; 16+ messages in thread
From: Trond Myklebust @ 2009-04-25 14:57 UTC (permalink / raw)
  To: Rince, Nick Piggin; +Cc: Andrew Morton, linux-kernel, linux-nfs

On Fri, 2009-04-24 at 05:26 -0400, Rince wrote:
> Applied try 3 of Nick Piggin's patch to 2.6.30-rc3 (cleanly, no less!)
> 
> Doesn't appear to have helped at all - I received my favorite BUG ON
> write.c:252 just like always, within 24 hours of booting the kernel,
> even.

Can you apply the following incremental patch on top of Nick's. This
appears to suffice to close the race on my setup.

Cheers
  Trond
---------------------------------------------------------------------
>From f0258852dcb43c748854d2ee550c9c270bb25f21 Mon Sep 17 00:00:00 2001
From: Trond Myklebust <Trond.Myklebust@netapp.com>
Date: Fri, 24 Apr 2009 17:32:22 -0400
Subject: [PATCH] NFS: Close page_mkwrite() races

Follow up to Nick Piggin's patches to ensure that nfs_vm_page_mkwrite
returns with the page lock held, and sets the VM_FAULT_LOCKED flag.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
---
 fs/nfs/file.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 5a97bcf..ec7e27d 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -517,10 +517,10 @@ static int nfs_vm_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 
 	ret = nfs_updatepage(filp, page, 0, pagelen);
 out_unlock:
+	if (!ret)
+		return VM_FAULT_LOCKED;
 	unlock_page(page);
-	if (ret)
-		ret = VM_FAULT_SIGBUS;
-	return ret;
+	return VM_FAULT_SIGBUS;
 }
 
 static struct vm_operations_struct nfs_file_vm_ops = {
-- 
1.6.0.6




^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: NFS BUG_ON in nfs_do_writepage
  2009-04-25 14:57           ` Trond Myklebust
@ 2009-04-26  6:40             ` Nick Piggin
  2009-04-26 14:18               ` Trond Myklebust
  0 siblings, 1 reply; 16+ messages in thread
From: Nick Piggin @ 2009-04-26  6:40 UTC (permalink / raw)
  To: Trond Myklebust, linux-fsdevel
  Cc: Rince, Andrew Morton, linux-kernel, linux-nfs

On Sat, Apr 25, 2009 at 10:57:08AM -0400, Trond Myklebust wrote:
> On Fri, 2009-04-24 at 05:26 -0400, Rince wrote:
> > Applied try 3 of Nick Piggin's patch to 2.6.30-rc3 (cleanly, no less!)
> > 
> > Doesn't appear to have helped at all - I received my favorite BUG ON
> > write.c:252 just like always, within 24 hours of booting the kernel,
> > even.
> 
> Can you apply the following incremental patch on top of Nick's. This
> appears to suffice to close the race on my setup.

Thanks, yes that looks good. Note: I deliberately didn't try to
convert filesystems because it needs much better understanding
of each one. So any fs maintainers using page_mkwrite I hope have
looked at these patches and considered whether they need to
do anything differently (ditto for the page_mkwrite return value
fixup patch).

Thanks,
Nick

> 
> Cheers
>   Trond
> ---------------------------------------------------------------------
> >From f0258852dcb43c748854d2ee550c9c270bb25f21 Mon Sep 17 00:00:00 2001
> From: Trond Myklebust <Trond.Myklebust@netapp.com>
> Date: Fri, 24 Apr 2009 17:32:22 -0400
> Subject: [PATCH] NFS: Close page_mkwrite() races
> 
> Follow up to Nick Piggin's patches to ensure that nfs_vm_page_mkwrite
> returns with the page lock held, and sets the VM_FAULT_LOCKED flag.
> 
> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
> ---
>  fs/nfs/file.c |    6 +++---
>  1 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/nfs/file.c b/fs/nfs/file.c
> index 5a97bcf..ec7e27d 100644
> --- a/fs/nfs/file.c
> +++ b/fs/nfs/file.c
> @@ -517,10 +517,10 @@ static int nfs_vm_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
>  
>  	ret = nfs_updatepage(filp, page, 0, pagelen);
>  out_unlock:
> +	if (!ret)
> +		return VM_FAULT_LOCKED;
>  	unlock_page(page);
> -	if (ret)
> -		ret = VM_FAULT_SIGBUS;
> -	return ret;
> +	return VM_FAULT_SIGBUS;
>  }
>  
>  static struct vm_operations_struct nfs_file_vm_ops = {
> -- 
> 1.6.0.6
> 
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: NFS BUG_ON in nfs_do_writepage
  2009-04-26  6:40             ` Nick Piggin
@ 2009-04-26 14:18               ` Trond Myklebust
  2009-04-26 15:13                 ` Nick Piggin
  0 siblings, 1 reply; 16+ messages in thread
From: Trond Myklebust @ 2009-04-26 14:18 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel, Rince, Andrew Morton, linux-kernel, linux-nfs

On Sun, 2009-04-26 at 08:40 +0200, Nick Piggin wrote:
> On Sat, Apr 25, 2009 at 10:57:08AM -0400, Trond Myklebust wrote:
> > On Fri, 2009-04-24 at 05:26 -0400, Rince wrote:
> > > Applied try 3 of Nick Piggin's patch to 2.6.30-rc3 (cleanly, no less!)
> > > 
> > > Doesn't appear to have helped at all - I received my favorite BUG ON
> > > write.c:252 just like always, within 24 hours of booting the kernel,
> > > even.
> > 
> > Can you apply the following incremental patch on top of Nick's. This
> > appears to suffice to close the race on my setup.
> 
> Thanks, yes that looks good. Note: I deliberately didn't try to
> convert filesystems because it needs much better understanding
> of each one. So any fs maintainers using page_mkwrite I hope have
> looked at these patches and considered whether they need to
> do anything differently (ditto for the page_mkwrite return value
> fixup patch).

Note that after applying this, I put a WARN_ON(!PageDirty()) in the NFS
set_page_dirty() method, and ran some mmap stress tests.

As far as I can tell, the WARN_ON is never triggering. I take this to
mean that the remaining cases where the VM is calling set_page_dirty()
are basically all cases where we've already seen a page fault and set
the page dirty flag, but haven't yet written it out (i.e. we haven't yet
called clear_page_dirty_for_io() and so the pte is still marked as
dirty).
That again implies that set_page_dirty() is now fully redundant for
filesystems that define page_mkwrite(), provided that the filesystem
takes charge of marking the page as dirty.

This suggests an alternative fix for the stable kernels in the form of
the following patch.
Comments?

Cheers
  Trond
------------------------------------------------------------------
commit 684049bf73059aa9be35f9cdf07acda29ebb0340
Author: Trond Myklebust <Trond.Myklebust@netapp.com>
Date:   Sun Apr 26 10:14:34 2009 -0400

    NFS: Fix page dirtying races in NFS
    
    If a filesystem defines a page_mkwrite() callback that also marks the page
    as being dirty, then we don't need to define a set_page_dirty() callback.
    
    The following patch fixes http://bugzilla.kernel.org/show_bug.cgi?id=12913
    by eliminating a race in do_wp_page() and __do_fault(). The latter two
    mark the page as dirty after the call to page_mkwrite(). Since
    nfs_vm_page_mkwrite() has already marked the page as dirty, this means that
    there is a race whereby the filesystem may actually have cleaned the
    page by the time it is marked as dirty (again) by the VM.
    
    Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
---
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 5a97bcf..21bffaf 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -465,10 +465,19 @@ static int nfs_launder_page(struct page *page)
 	return nfs_wb_page(inode, page);
 }
 
+static int nfs_set_page_dirty(struct page *page)
+{
+	/* We don't need to have the VM mark the page as dirty, since
+	 * nfs_updatepage() will do it. This eliminates the race
+	 * that caused http://bugzilla.kernel.org/show_bug.cgi?id=12913
+	 */
+	return 0;
+}
+
 const struct address_space_operations nfs_file_aops = {
 	.readpage = nfs_readpage,
 	.readpages = nfs_readpages,
-	.set_page_dirty = __set_page_dirty_nobuffers,
+	.set_page_dirty = nfs_set_page_dirty,
 	.writepage = nfs_writepage,
 	.writepages = nfs_writepages,
 	.write_begin = nfs_write_begin,



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: NFS BUG_ON in nfs_do_writepage
  2009-04-26 14:18               ` Trond Myklebust
@ 2009-04-26 15:13                 ` Nick Piggin
  2009-04-26 17:55                   ` Trond Myklebust
  0 siblings, 1 reply; 16+ messages in thread
From: Nick Piggin @ 2009-04-26 15:13 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: linux-fsdevel, Rince, Andrew Morton, linux-kernel, linux-nfs

On Sun, Apr 26, 2009 at 10:18:29AM -0400, Trond Myklebust wrote:
> On Sun, 2009-04-26 at 08:40 +0200, Nick Piggin wrote:
> > On Sat, Apr 25, 2009 at 10:57:08AM -0400, Trond Myklebust wrote:
> > > On Fri, 2009-04-24 at 05:26 -0400, Rince wrote:
> > > > Applied try 3 of Nick Piggin's patch to 2.6.30-rc3 (cleanly, no less!)
> > > > 
> > > > Doesn't appear to have helped at all - I received my favorite BUG ON
> > > > write.c:252 just like always, within 24 hours of booting the kernel,
> > > > even.
> > > 
> > > Can you apply the following incremental patch on top of Nick's. This
> > > appears to suffice to close the race on my setup.
> > 
> > Thanks, yes that looks good. Note: I deliberately didn't try to
> > convert filesystems because it needs much better understanding
> > of each one. So any fs maintainers using page_mkwrite I hope have
> > looked at these patches and considered whether they need to
> > do anything differently (ditto for the page_mkwrite return value
> > fixup patch).
> 
> Note that after applying this, I put a WARN_ON(!PageDirty()) in the NFS
> set_page_dirty() method, and ran some mmap stress tests.
> 
> As far as I can tell, the WARN_ON is never triggering. I take this to
> mean that the remaining cases where the VM is calling set_page_dirty()
> are basically all cases where we've already seen a page fault and set
> the page dirty flag, but haven't yet written it out (i.e. we haven't yet
> called clear_page_dirty_for_io() and so the pte is still marked as
> dirty).
> That again implies that set_page_dirty() is now fully redundant for
> filesystems that define page_mkwrite(), provided that the filesystem
> takes charge of marking the page as dirty.
> 
> This suggests an alternative fix for the stable kernels in the form of
> the following patch.
> Comments?

This doesn't seem to fix the race, though... on kernels with the
race still there, it will just open a window where you can have
a dirty pte but the page not written out.

I don't understand.

> Cheers
>   Trond
> ------------------------------------------------------------------
> commit 684049bf73059aa9be35f9cdf07acda29ebb0340
> Author: Trond Myklebust <Trond.Myklebust@netapp.com>
> Date:   Sun Apr 26 10:14:34 2009 -0400
> 
>     NFS: Fix page dirtying races in NFS
>     
>     If a filesystem defines a page_mkwrite() callback that also marks the page
>     as being dirty, then we don't need to define a set_page_dirty() callback.
>     
>     The following patch fixes http://bugzilla.kernel.org/show_bug.cgi?id=12913
>     by eliminating a race in do_wp_page() and __do_fault(). The latter two
>     mark the page as dirty after the call to page_mkwrite(). Since
>     nfs_vm_page_mkwrite() has already marked the page as dirty, this means that
>     there is a race whereby the filesystem may actually have cleaned the
>     page by the time it is marked as dirty (again) by the VM.
>     
>     Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
> ---
> diff --git a/fs/nfs/file.c b/fs/nfs/file.c
> index 5a97bcf..21bffaf 100644
> --- a/fs/nfs/file.c
> +++ b/fs/nfs/file.c
> @@ -465,10 +465,19 @@ static int nfs_launder_page(struct page *page)
>  	return nfs_wb_page(inode, page);
>  }
>  
> +static int nfs_set_page_dirty(struct page *page)
> +{
> +	/* We don't need to have the VM mark the page as dirty, since
> +	 * nfs_updatepage() will do it. This eliminates the race
> +	 * that caused http://bugzilla.kernel.org/show_bug.cgi?id=12913
> +	 */
> +	return 0;
> +}
> +
>  const struct address_space_operations nfs_file_aops = {
>  	.readpage = nfs_readpage,
>  	.readpages = nfs_readpages,
> -	.set_page_dirty = __set_page_dirty_nobuffers,
> +	.set_page_dirty = nfs_set_page_dirty,
>  	.writepage = nfs_writepage,
>  	.writepages = nfs_writepages,
>  	.write_begin = nfs_write_begin,
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: NFS BUG_ON in nfs_do_writepage
  2009-04-26 15:13                 ` Nick Piggin
@ 2009-04-26 17:55                   ` Trond Myklebust
  2009-04-28  4:27                     ` Nick Piggin
  0 siblings, 1 reply; 16+ messages in thread
From: Trond Myklebust @ 2009-04-26 17:55 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel, Rince, Andrew Morton, linux-kernel, linux-nfs

On Sun, 2009-04-26 at 17:13 +0200, Nick Piggin wrote:
> On Sun, Apr 26, 2009 at 10:18:29AM -0400, Trond Myklebust wrote:
> > On Sun, 2009-04-26 at 08:40 +0200, Nick Piggin wrote:
> > > On Sat, Apr 25, 2009 at 10:57:08AM -0400, Trond Myklebust wrote:
> > > > On Fri, 2009-04-24 at 05:26 -0400, Rince wrote:
> > > > > Applied try 3 of Nick Piggin's patch to 2.6.30-rc3 (cleanly, no less!)
> > > > > 
> > > > > Doesn't appear to have helped at all - I received my favorite BUG ON
> > > > > write.c:252 just like always, within 24 hours of booting the kernel,
> > > > > even.
> > > > 
> > > > Can you apply the following incremental patch on top of Nick's. This
> > > > appears to suffice to close the race on my setup.
> > > 
> > > Thanks, yes that looks good. Note: I deliberately didn't try to
> > > convert filesystems because it needs much better understanding
> > > of each one. So any fs maintainers using page_mkwrite I hope have
> > > looked at these patches and considered whether they need to
> > > do anything differently (ditto for the page_mkwrite return value
> > > fixup patch).
> > 
> > Note that after applying this, I put a WARN_ON(!PageDirty()) in the NFS
> > set_page_dirty() method, and ran some mmap stress tests.
> > 
> > As far as I can tell, the WARN_ON is never triggering. I take this to
> > mean that the remaining cases where the VM is calling set_page_dirty()
> > are basically all cases where we've already seen a page fault and set
> > the page dirty flag, but haven't yet written it out (i.e. we haven't yet
> > called clear_page_dirty_for_io() and so the pte is still marked as
> > dirty).
> > That again implies that set_page_dirty() is now fully redundant for
> > filesystems that define page_mkwrite(), provided that the filesystem
> > takes charge of marking the page as dirty.
> > 
> > This suggests an alternative fix for the stable kernels in the form of
> > the following patch.
> > Comments?
> 
> This doesn't seem to fix the race, though... on kernels with the
> race still there, it will just open a window where you can have
> a dirty pte but the page not written out.
> 
> I don't understand.

I'm just pointing out that the NFS client already calls
__set_page_dirty_nobuffers() while holding the page lock inside the
nfs_vm_page_mkwrite() call, so having the VM do it too in the call to
set_page_dirty_balance() is actually redundant. IOW: as far as the NFS
code is concerned, we can get rid of the ->set_page_dirty() callback in
that situation.

I couldn't find any other places in the VM code where we can have a
dirty pte without also having called page_mkwrite() (and hence
__set_page_dirty_nobuffers). As I said, adding a WARN_ON(!PageDirty())
in ->set_page_dirty() didn't ever trigger any cases where the
set_page_dirty() was actually setting the dirty bit (except in the case
where we race with page writeout in do_wp_page() and __do_fault()).

That's why I believe disabling ->set_page_dirty() is safe here, and will
in fact suffice to fix the page writeout race.

Cheers
  Trond


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: NFS BUG_ON in nfs_do_writepage
  2009-04-26 17:55                   ` Trond Myklebust
@ 2009-04-28  4:27                     ` Nick Piggin
  2009-04-28 11:45                       ` Trond Myklebust
  0 siblings, 1 reply; 16+ messages in thread
From: Nick Piggin @ 2009-04-28  4:27 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: linux-fsdevel, Rince, Andrew Morton, linux-kernel, linux-nfs

On Sun, Apr 26, 2009 at 01:55:22PM -0400, Trond Myklebust wrote:
> On Sun, 2009-04-26 at 17:13 +0200, Nick Piggin wrote:
> > This doesn't seem to fix the race, though... on kernels with the
> > race still there, it will just open a window where you can have
> > a dirty pte but the page not written out.
> > 
> > I don't understand.
> 
> I'm just pointing out that the NFS client already calls
> __set_page_dirty_nobuffers() while holding the page lock inside the
> nfs_vm_page_mkwrite() call, so having the VM do it too in the call to
> set_page_dirty_balance() is actually redundant. IOW: as far as the NFS
> code is concerned, we can get rid of the ->set_page_dirty() callback in
> that situation.
> 
> I couldn't find any other places in the VM code where we can have a
> dirty pte without also having called page_mkwrite() (and hence
> __set_page_dirty_nobuffers). As I said, adding a WARN_ON(!PageDirty())
> in ->set_page_dirty() didn't ever trigger any cases where the
> set_page_dirty() was actually setting the dirty bit (except in the case
> where we race with page writeout in do_wp_page() and __do_fault()).
> 
> That's why I believe disabling ->set_page_dirty() is safe here, and will
> in fact suffice to fix the page writeout race.

Ah, no I don't think so because it opens another race where the
pte is dity but the page is marked clean.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: NFS BUG_ON in nfs_do_writepage
  2009-04-28  4:27                     ` Nick Piggin
@ 2009-04-28 11:45                       ` Trond Myklebust
  2009-04-28 11:54                         ` Nick Piggin
  0 siblings, 1 reply; 16+ messages in thread
From: Trond Myklebust @ 2009-04-28 11:45 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel, Rince, Andrew Morton, linux-kernel, linux-nfs

On Tue, 2009-04-28 at 06:27 +0200, Nick Piggin wrote:
> On Sun, Apr 26, 2009 at 01:55:22PM -0400, Trond Myklebust wrote:
> > On Sun, 2009-04-26 at 17:13 +0200, Nick Piggin wrote:
> > > This doesn't seem to fix the race, though... on kernels with the
> > > race still there, it will just open a window where you can have
> > > a dirty pte but the page not written out.
> > > 
> > > I don't understand.
> > 
> > I'm just pointing out that the NFS client already calls
> > __set_page_dirty_nobuffers() while holding the page lock inside the
> > nfs_vm_page_mkwrite() call, so having the VM do it too in the call to
> > set_page_dirty_balance() is actually redundant. IOW: as far as the NFS
> > code is concerned, we can get rid of the ->set_page_dirty() callback in
> > that situation.
> > 
> > I couldn't find any other places in the VM code where we can have a
> > dirty pte without also having called page_mkwrite() (and hence
> > __set_page_dirty_nobuffers). As I said, adding a WARN_ON(!PageDirty())
> > in ->set_page_dirty() didn't ever trigger any cases where the
> > set_page_dirty() was actually setting the dirty bit (except in the case
> > where we race with page writeout in do_wp_page() and __do_fault()).
> > 
> > That's why I believe disabling ->set_page_dirty() is safe here, and will
> > in fact suffice to fix the page writeout race.
> 
> Ah, no I don't think so because it opens another race where the
> pte is dity but the page is marked clean.

So how can that happen?

AFAICS, when the pte is dirtied, we should get a page fault, which
causes the page itself to be marked dirty by the nfs_vm_page_mkwrite()
callback.
When the page gets written out, the VM calls clear_page_dirty_for_io()
which also causes the pte to be cleaned.

At what point can you therefore have a situation where the pte is dirty
without the page being marked as dirty too?



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: NFS BUG_ON in nfs_do_writepage
  2009-04-28 11:45                       ` Trond Myklebust
@ 2009-04-28 11:54                         ` Nick Piggin
  2009-04-28 11:59                           ` Trond Myklebust
  0 siblings, 1 reply; 16+ messages in thread
From: Nick Piggin @ 2009-04-28 11:54 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: linux-fsdevel, Rince, Andrew Morton, linux-kernel, linux-nfs

On Tue, Apr 28, 2009 at 07:45:17AM -0400, Trond Myklebust wrote:
> On Tue, 2009-04-28 at 06:27 +0200, Nick Piggin wrote:
> > On Sun, Apr 26, 2009 at 01:55:22PM -0400, Trond Myklebust wrote:
> > > On Sun, 2009-04-26 at 17:13 +0200, Nick Piggin wrote:
> > > > This doesn't seem to fix the race, though... on kernels with the
> > > > race still there, it will just open a window where you can have
> > > > a dirty pte but the page not written out.
> > > > 
> > > > I don't understand.
> > > 
> > > I'm just pointing out that the NFS client already calls
> > > __set_page_dirty_nobuffers() while holding the page lock inside the
> > > nfs_vm_page_mkwrite() call, so having the VM do it too in the call to
> > > set_page_dirty_balance() is actually redundant. IOW: as far as the NFS
> > > code is concerned, we can get rid of the ->set_page_dirty() callback in
> > > that situation.
> > > 
> > > I couldn't find any other places in the VM code where we can have a
> > > dirty pte without also having called page_mkwrite() (and hence
> > > __set_page_dirty_nobuffers). As I said, adding a WARN_ON(!PageDirty())
> > > in ->set_page_dirty() didn't ever trigger any cases where the
> > > set_page_dirty() was actually setting the dirty bit (except in the case
> > > where we race with page writeout in do_wp_page() and __do_fault()).
> > > 
> > > That's why I believe disabling ->set_page_dirty() is safe here, and will
> > > in fact suffice to fix the page writeout race.
> > 
> > Ah, no I don't think so because it opens another race where the
> > pte is dity but the page is marked clean.
> 
> So how can that happen?

If the page gets cleaned after page_mkwrite and before the page
table locks are taken again in order to set the pte writeable.
(actually, page_mkclean only runs if it finds mapcount elevated,
so it is enough to clean the page even after the locks are taken
and before mapcount is incremented in the case of __do_fault).

 
> AFAICS, when the pte is dirtied, we should get a page fault, which
> causes the page itself to be marked dirty by the nfs_vm_page_mkwrite()
> callback.
> When the page gets written out, the VM calls clear_page_dirty_for_io()
> which also causes the pte to be cleaned.
> 
> At what point can you therefore have a situation where the pte is dirty
> without the page being marked as dirty too?

 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: NFS BUG_ON in nfs_do_writepage
  2009-04-28 11:54                         ` Nick Piggin
@ 2009-04-28 11:59                           ` Trond Myklebust
  0 siblings, 0 replies; 16+ messages in thread
From: Trond Myklebust @ 2009-04-28 11:59 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel, Rince, Andrew Morton, linux-kernel, linux-nfs

On Tue, 2009-04-28 at 13:54 +0200, Nick Piggin wrote:
> If the page gets cleaned after page_mkwrite and before the page
> table locks are taken again in order to set the pte writeable.
> (actually, page_mkclean only runs if it finds mapcount elevated,
> so it is enough to clean the page even after the locks are taken
> and before mapcount is incremented in the case of __do_fault).

OK. This was what I was missing...

Thanks!


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2009-04-28 11:59 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-04-13  5:46 NFS BUG_ON in nfs_do_writepage rercola
2009-04-13  6:50 ` Andrew Morton
2009-04-13 19:16   ` Trond Myklebust
2009-04-13 22:06     ` Rince
2009-04-13 23:44       ` Rince
2009-04-24  9:26         ` Rince
2009-04-24 14:14           ` Trond Myklebust
2009-04-25 14:57           ` Trond Myklebust
2009-04-26  6:40             ` Nick Piggin
2009-04-26 14:18               ` Trond Myklebust
2009-04-26 15:13                 ` Nick Piggin
2009-04-26 17:55                   ` Trond Myklebust
2009-04-28  4:27                     ` Nick Piggin
2009-04-28 11:45                       ` Trond Myklebust
2009-04-28 11:54                         ` Nick Piggin
2009-04-28 11:59                           ` Trond Myklebust

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).