From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752944AbdLHJKs (ORCPT ); Fri, 8 Dec 2017 04:10:48 -0500 Received: from LGEAMRELO13.lge.com ([156.147.23.53]:33579 "EHLO lgeamrelo13.lge.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751021AbdLHJKo (ORCPT ); Fri, 8 Dec 2017 04:10:44 -0500 X-Original-SENDERIP: 156.147.1.121 X-Original-MAILFROM: minchan@kernel.org X-Original-SENDERIP: 10.177.220.163 X-Original-MAILFROM: minchan@kernel.org Date: Fri, 8 Dec 2017 18:10:42 +0900 From: Minchan Kim To: "Huang, Ying" Cc: Andrew Morton , "Paul E. McKenney" , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Hugh Dickins , Johannes Weiner , Tim Chen , Shaohua Li , Mel Gorman , =?utf-8?B?Su+/vXLvv71tZQ==?= Glisse , Michal Hocko , Andrea Arcangeli , David Rientjes , Rik van Riel , Jan Kara , Dave Jiang , Aaron Lu Subject: Re: [PATCH -mm] mm, swap: Fix race between swapoff and some swap operations Message-ID: <20171208091042.GA14472@bbox> References: <20171207011426.1633-1-ying.huang@intel.com> <20171207162937.6a179063a7c92ecac77e44af@linux-foundation.org> <20171208014346.GA8915@bbox> <87po7pg4jt.fsf@yhuang-dev.intel.com> <20171208082644.GA14361@bbox> <87k1xxbohp.fsf@yhuang-dev.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <87k1xxbohp.fsf@yhuang-dev.intel.com> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Dec 08, 2017 at 04:41:38PM +0800, Huang, Ying wrote: > Minchan Kim writes: > > > On Fri, Dec 08, 2017 at 01:41:10PM +0800, Huang, Ying wrote: > >> Minchan Kim writes: > >> > >> > On Thu, Dec 07, 2017 at 04:29:37PM -0800, Andrew Morton wrote: > >> >> On Thu, 7 Dec 2017 09:14:26 +0800 "Huang, Ying" wrote: > >> >> > >> >> > When the swapin is performed, after getting the swap entry information > >> >> > from the page table, the PTL (page table lock) will be released, then > >> >> > system will go to swap in the swap entry, without any lock held to > >> >> > prevent the swap device from being swapoff. This may cause the race > >> >> > like below, > >> >> > > >> >> > CPU 1 CPU 2 > >> >> > ----- ----- > >> >> > do_swap_page > >> >> > swapin_readahead > >> >> > __read_swap_cache_async > >> >> > swapoff swapcache_prepare > >> >> > p->swap_map = NULL __swap_duplicate > >> >> > p->swap_map[?] /* !!! NULL pointer access */ > >> >> > > >> >> > Because swap off is usually done when system shutdown only, the race > >> >> > may not hit many people in practice. But it is still a race need to > >> >> > be fixed. > >> >> > >> >> swapoff is so rare that it's hard to get motivated about any fix which > >> >> adds overhead to the regular codepaths. > >> > > >> > That was my concern, too when I see this patch. > >> > > >> >> > >> >> Is there something we can do to ensure that all the overhead of this > >> >> fix is placed into the swapoff side? stop_machine() may be a bit > >> >> brutal, but a surprising amount of code uses it. Any other ideas? > >> > > >> > How about this? > >> > > >> > I think It's same approach with old where we uses si->lock everywhere > >> > instead of more fine-grained cluster lock. > >> > > >> > The reason I repeated to reset p->max to zero in the loop is to avoid > >> > using lockdep annotation(maybe, spin_lock_nested(something) to prevent > >> > false positive. > >> > > >> > diff --git a/mm/swapfile.c b/mm/swapfile.c > >> > index 42fe5653814a..9ce007a42bbc 100644 > >> > --- a/mm/swapfile.c > >> > +++ b/mm/swapfile.c > >> > @@ -2644,6 +2644,19 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile) > >> > swap_file = p->swap_file; > >> > old_block_size = p->old_block_size; > >> > p->swap_file = NULL; > >> > + > >> > + if (p->flags & SWP_SOLIDSTATE) { > >> > + unsigned long ci, nr_cluster; > >> > + > >> > + nr_cluster = DIV_ROUND_UP(p->max, SWAPFILE_CLUSTER); > >> > + for (ci = 0; ci < nr_cluster; ci++) { > >> > + struct swap_cluster_info *sci; > >> > + > >> > + sci = lock_cluster(p, ci * SWAPFILE_CLUSTER); > >> > + p->max = 0; > >> > + unlock_cluster(sci); > >> > + } > >> > + } > >> > p->max = 0; > >> > swap_map = p->swap_map; > >> > p->swap_map = NULL; > >> > @@ -3369,10 +3382,10 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage) > >> > goto bad_file; > >> > p = swap_info[type]; > >> > offset = swp_offset(entry); > >> > - if (unlikely(offset >= p->max)) > >> > - goto out; > >> > > >> > ci = lock_cluster_or_swap_info(p, offset); > >> > + if (unlikely(offset >= p->max)) > >> > + goto unlock_out; > >> > > >> > count = p->swap_map[offset]; > >> > > >> > >> Sorry, this doesn't work, because > >> > >> lock_cluster_or_swap_info() > >> > >> Need to read p->cluster_info, which may be freed during swapoff too. > >> > >> > >> To reduce the added overhead in regular code path, Maybe we can use SRCU > >> to implement get_swap_device() and put_swap_device()? There is only > >> increment/decrement on CPU local variable in srcu_read_lock/unlock(). > >> Should be acceptable in not so hot swap path? > >> > >> This needs to select CONFIG_SRCU if CONFIG_SWAP is enabled. But I guess > >> that should be acceptable too? > >> > > > > Why do we need srcu here? Is it enough with rcu like below? > > > > It might have a bug/room to be optimized about performance/naming. > > I just wanted to show my intention. > > Yes. rcu should work too. But if we use rcu, it may need to be called > several times to make sure the swap device under us doesn't go away, for > example, when checking si->max in __swp_swapcount() and I think it's not a big concern performance pov and benefit is good abstraction through current locking function so we don't need much churn. > add_swap_count_continuation(). And I found we need rcu to protect swap > cache radix tree array too. So I think it may be better to use one Could you elaborate it more about swap cache arrary problem? > calling to srcu_read_lock/unlock() instead of multiple callings to > rcu_read_lock/unlock(). > > Best Regards, > Huang, Ying > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org