All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mel Gorman <mgorman@techsingularity.net>
To: Jerome Marchand <jmarchan@redhat.com>
Cc: Trond Myklebust <trond.myklebust@primarydata.com>,
	Anna Schumaker <anna.schumaker@netapp.com>,
	Christoph Hellwig <hch@infradead.org>,
	Linux NFS Mailing List <linux-nfs@vger.kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Mel Gorman <mgorman@suse.de>
Subject: Re: [RFC PATCH] nfs: avoid swap-over-NFS deadlock
Date: Thu, 20 Aug 2015 13:23:59 +0100	[thread overview]
Message-ID: <20150820122359.GB12432@techsingularity.net> (raw)
In-Reply-To: <55B6153B.1070604@redhat.com>

On Mon, Jul 27, 2015 at 01:25:47PM +0200, Jerome Marchand wrote:
> On 07/27/2015 12:52 PM, Mel Gorman wrote:
> > On Wed, Jul 22, 2015 at 03:46:16PM +0200, Jerome Marchand wrote:
> >> On 07/22/2015 02:23 PM, Trond Myklebust wrote:
> >>> On Wed, Jul 22, 2015 at 4:10 AM, Jerome Marchand <jmarchan@redhat.com> wrote:
> >>>>
> >>>> Lockdep warns about a inconsistent {RECLAIM_FS-ON-W} ->
> >>>> {IN-RECLAIM_FS-W} usage. The culpritt is the inode->i_mutex taken in
> >>>> nfs_file_direct_write(). This code was introduced by commit a9ab5e840669
> >>>> ("nfs: page cache invalidation for dio").
> >>>> This naive test patch avoid to take the mutex on a swapfile and makes
> >>>> lockdep happy again. However I don't know much about NFS code and I
> >>>> assume it's probably not the proper solution. Any thought?
> >>>>
> >>>> Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
> >>>
> >>> NFS is not the only O_DIRECT implementation to set the inode->i_mutex.
> >>> Why can't this be fixed in the generic swap code instead of adding
> >>> yet-another-exception-for-IS_SWAPFILE?
> >>
> >> I meant to cc Mel. Just added him.
> >>
> > 
> > Can the full lockdep warning be included as it'll be easier to see then if
> > the generic swap code can somehow special case this? Currently, generic
> > swapping does not not need to care about how the filesystem locked.
> > For most filesystems, it's writing directly to the blocks on disk and
> > bypassing the FS. In the NFS case it'd be surprising to find that there
> > also are dirty pages in page cache that belong to the swap file as it's
> > going to cause corruption. If there is any special casing it would to only
> > attempt the invalidation in the !swap case and warn if mapping->nrpages. It
> > still would look a bit weird but safer than just not acquiring the mutex
> > and then potentially attempting an invalidation.
> > 
> 
> [ 6819.501009] =================================
> [ 6819.501009] [ INFO: inconsistent lock state ]
> [ 6819.501009] 4.2.0-rc1-shmacct-babka-v2-next-20150709+ #255 Not tainted
> [ 6819.501009] ---------------------------------

Thanks. Sorry for the long delay but I finally got back to the bug this
week. NFS can be modified to special case the swapfile but I was not happy
with the result for multiple reasons. It took me a while to see a way for
the core VM to deal with it. What do you think of the following
approach? More importantly, does it work for you?

---8<---
nfs: Use swap_lock to prevent parallel swapon activations

Jerome Marchand reported a lockdep warning as follows

    [ 6819.501009] =================================
    [ 6819.501009] [ INFO: inconsistent lock state ]
    [ 6819.501009] 4.2.0-rc1-shmacct-babka-v2-next-20150709+ #255 Not tainted
    [ 6819.501009] ---------------------------------
    [ 6819.501009] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
    [ 6819.501009] kswapd0/38 [HC0[0]:SC0[0]:HE1:SE1] takes:
    [ 6819.501009]  (&sb->s_type->i_mutex_key#17){+.+.?.}, at: [<ffffffffa03772a5>] nfs_file_direct_write+0x85/0x3f0 [nfs]
    [ 6819.501009] {RECLAIM_FS-ON-W} state was registered at:
    [ 6819.501009]   [<ffffffff81107f51>] mark_held_locks+0x71/0x90
    [ 6819.501009]   [<ffffffff8110b775>] lockdep_trace_alloc+0x75/0xe0
    [ 6819.501009]   [<ffffffff81245529>] kmem_cache_alloc_node_trace+0x39/0x440
    [ 6819.501009]   [<ffffffff81225b8f>] __get_vm_area_node+0x7f/0x160
    [ 6819.501009]   [<ffffffff81226eb2>] __vmalloc_node_range+0x72/0x2c0
    [ 6819.501009]   [<ffffffff81227424>] vzalloc+0x54/0x60
    [ 6819.501009]   [<ffffffff8122c7c8>] SyS_swapon+0x628/0xfc0
    [ 6819.501009]   [<ffffffff81867772>] entry_SYSCALL_64_fastpath+0x12/0x76

It's due to NFS acquiring i_mutex since a9ab5e840669 ("nfs: page
cache invalidation for dio") to invalidate page cache before direct I/O.
Filesystems may safely acquire i_mutex during direct writes but NFS is unique
in its treatment of swap files. Ordinarily swap files are supported by the
core VM looking up the physical block for a given offset in advance. There
is no physical block for NFS and the direct write paths are used after
calling mapping->swap_activate.

The lockdep warning is triggered by swapon(), which is not in reclaim
context, acquiring the i_mutex to ensure a swapfile is not activated twice.

swapon does not need the i_mutex for this purpose.  There is a requirement
that fallocate not be used on swapfiles but this is protected by the inode
flag S_SWAPFILE and nothing to do with i_mutex. In fact, the current
protection does nothing for block devices. This patch expands the role
of swap_lock to protect against parallel activations of block devices and
swapfiles and removes the use of i_mutex. This both improves the protection
for swapon and avoids the lockdep warning.

Reported-by: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/swapfile.c | 16 +++++++---------
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 41e4581af7c5..d58ed6833fa3 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1928,9 +1928,9 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 		set_blocksize(bdev, old_block_size);
 		blkdev_put(bdev, FMODE_READ | FMODE_WRITE | FMODE_EXCL);
 	} else {
-		mutex_lock(&inode->i_mutex);
+		spin_lock(&swap_lock);
 		inode->i_flags &= ~S_SWAPFILE;
-		mutex_unlock(&inode->i_mutex);
+		spin_unlock(&swap_lock);
 	}
 	filp_close(swap_file, NULL);
 
@@ -2156,7 +2156,6 @@ static int claim_swapfile(struct swap_info_struct *p, struct inode *inode)
 		p->flags |= SWP_BLKDEV;
 	} else if (S_ISREG(inode->i_mode)) {
 		p->bdev = inode->i_sb->s_bdev;
-		mutex_lock(&inode->i_mutex);
 		if (IS_SWAPFILE(inode))
 			return -EBUSY;
 	} else
@@ -2386,6 +2385,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 		goto bad_swap;
 	}
 
+	/* prevent parallel swapons */
+	spin_lock(&swap_lock);
 	p->swap_file = swap_file;
 	mapping = swap_file->f_mapping;
 
@@ -2396,13 +2397,14 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 			continue;
 		if (mapping == q->swap_file->f_mapping) {
 			error = -EBUSY;
+			spin_unlock(&swap_lock);
 			goto bad_swap;
 		}
 	}
 
 	inode = mapping->host;
-	/* If S_ISREG(inode->i_mode) will do mutex_lock(&inode->i_mutex); */
 	error = claim_swapfile(p, inode);
+	spin_unlock(&swap_lock);
 	if (unlikely(error))
 		goto bad_swap;
 
@@ -2543,10 +2545,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	vfree(swap_map);
 	vfree(cluster_info);
 	if (swap_file) {
-		if (inode && S_ISREG(inode->i_mode)) {
-			mutex_unlock(&inode->i_mutex);
+		if (inode && S_ISREG(inode->i_mode))
 			inode = NULL;
-		}
 		filp_close(swap_file, NULL);
 	}
 out:
@@ -2556,8 +2556,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	}
 	if (name)
 		putname(name);
-	if (inode && S_ISREG(inode->i_mode))
-		mutex_unlock(&inode->i_mutex);
 	return error;
 }
 

  reply	other threads:[~2015-08-20 12:24 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-07-22  8:10 [RFC PATCH] nfs: avoid swap-over-NFS deadlock Jerome Marchand
2015-07-22 12:23 ` Trond Myklebust
2015-07-22 13:46   ` Jerome Marchand
2015-07-27 10:52     ` Mel Gorman
2015-07-27 11:25       ` Jerome Marchand
2015-08-20 12:23         ` Mel Gorman [this message]
2015-09-01 16:22           ` Jerome Marchand
2015-09-03 14:01             ` Mel Gorman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150820122359.GB12432@techsingularity.net \
    --to=mgorman@techsingularity.net \
    --cc=anna.schumaker@netapp.com \
    --cc=hch@infradead.org \
    --cc=jmarchan@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=trond.myklebust@primarydata.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.