Linux-XFS Archive on lore.kernel.org
 help / color / Atom feed
From: "Luis R. Rodriguez" <mcgrof@kernel.org>
To: linux-xfs@vger.kernel.org
Cc: Nikolay Borisov <nborisov@suse.com>, Jan Kara <jack@suse.cz>,
	Michal Hocko <mhocko@suse.com>
Subject: Reviewing determinism of xfs_reclaim_inodes_ag()
Date: Tue, 25 Apr 2017 10:25:03 +0200
Message-ID: <20170425082503.GN28800@wotan.suse.de> (raw)

I've been staring at xfs_reclaim_inodes_ag() for a bit now to determine
if and how its deterministic it is. Fortunately the code makes it relatively
clear it chugs on 32 xfs_inodes at at a time before cond_resched()'ing. I was
concerned originally if we cond_resched() even if we goto restart but it seems
we do.

While reviewing this however I ran into the following question based on a
comment on the routine. Is such a thing as the below change needed ?

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 3531f8f72fa5..05f3c79b4f11 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1157,7 +1157,9 @@ xfs_reclaim_inodes_ag(
 			for (i = 0; i < nr_found; i++) {
 				struct xfs_inode *ip = batch[i];
 
-				if (done || xfs_reclaim_inode_grab(ip, flags))
+				if (done ||
+				    XFS_INO_TO_AGNO(mp, ip->i_ino) != pag->pag_agno ||
+				    xfs_reclaim_inode_grab(ip, flags))
 					batch[i] = NULL;
 
 				/*

This is what the code looks like before but pay special attention to the
comment and a race with RCU:

                                if (done || xfs_reclaim_inode_grab(ip, flags))                                                                                                  
                                        batch[i] = NULL;                                                                                                                        
                                                                                                                                                                                
                                /*                                                                                                                                              
                                 * Update the index for the next lookup.                                                                                                        
                                 * Catch overflows into the next AG range                                                                                                       
                                 * which can occur if we have inodes in the                                                                                                     
                                 * last block of the AG and we are                                                                                                              
                                 * currently pointing to the last inode.                                                                                                        
                                 *                                                                                                                                              
                                 * Because we may see inodes that are from                                                                                                      
                                 * the wrong AG due to RCU freeing and                                                                                                          
                                 * reallocation, only update the index if                                                                                                       
                                 * it lies in this AG. It was a race that                                                                                                       
                                 * lead us to see this inode, so another                                                                                                        
                                 * lookup from                                                                                                                                  
                                 * the same index will not find it again.                                                                                                       
                                 */                                                                                                                                             
                                if (XFS_INO_TO_AGNO(mp, ip->i_ino) !=                                                                                                           
                                                                pag->pag_agno)                                                                                                  
                                        continue;                                                                                                                               
                                first_index = XFS_INO_TO_AGINO(mp, ip->i_ino + 1);                                                                                              
                                if (first_index < XFS_INO_TO_AGINO(mp, ip->i_ino))                                                                                              
                                        done = 1;  

This leads me to believe that without the above change we might be doing an
xfs_reclaim_inode_grab() and subsequent reclaim for an AG which does not match
the pag given xfs_reclaim_inode_grab() does not seem to check the pag matches.

a) Its unclear if the xfs_reclaim_inode() will find the inode after the
first xfs_reclaim_inode_grab() as its a race.
                                                                                                                                                                                
b) Its unclear what the implications of trying xfs_reclaim_inode_grab() followed
by xfs_reclaim_inode() without holding the correct pag mutex could be.

I checked with Jan Kara and he believes the current code is correct but that
its the comment that that may be misleading. As per Jan the race is between
getting an inode reclaimed and grabbing it. Ie, XFS frees the inodes by RCU.
However it doesn't actually *reuse* the inode until RCU period passes
(unlike inodes allocated from slab with SLAB_RCU can be). So it can happen
that inode we have (ip) is actually already reclaimed by the time we get to
xfs_reclaim_inode_grab() and that function detects this (as I_RECLAIM is                                                                                                        
set). Also ip->i_ino will be set to 0 in that case so we must really avoid
using that to update the 'first_index' and 'done'. But it seems to Jan we
take the 'continue' branch only if xfs_reclaim_inode_grab() returned 1
before.
                                                                                                                                                                                
If Jan is correct the comment that inode may be reallocated seems to
be wrong and pretty misleading and the change I suggest above would not be
needed.

Thoughts?

  Luis

             reply index

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-04-25  8:25 Luis R. Rodriguez [this message]
2017-04-26  0:04 ` Dave Chinner
2017-04-26  7:34   ` Jan Kara
2017-04-26  9:12   ` Luis R. Rodriguez
2017-04-26 10:55     ` Jan Kara
2017-05-06 17:41       ` Luis R. Rodriguez
2017-05-06 17:52         ` Luis R. Rodriguez
2017-05-09 10:41           ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170425082503.GN28800@wotan.suse.de \
    --to=mcgrof@kernel.org \
    --cc=jack@suse.cz \
    --cc=linux-xfs@vger.kernel.org \
    --cc=mhocko@suse.com \
    --cc=nborisov@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-XFS Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-xfs/0 linux-xfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-xfs linux-xfs/ https://lore.kernel.org/linux-xfs \
		linux-xfs@vger.kernel.org
	public-inbox-index linux-xfs

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-xfs


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git