* Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff
@ 2021-04-08 20:46 kernel test robot
0 siblings, 0 replies; 24+ messages in thread
From: kernel test robot @ 2021-04-08 20:46 UTC (permalink / raw)
To: kbuild
[-- Attachment #1: Type: text/plain, Size: 13284 bytes --]
CC: kbuild-all(a)lists.01.org
In-Reply-To: <20210408130820.48233-3-linmiaohe@huawei.com>
References: <20210408130820.48233-3-linmiaohe@huawei.com>
TO: Miaohe Lin <linmiaohe@huawei.com>
TO: akpm(a)linux-foundation.org
CC: hannes(a)cmpxchg.org
CC: mhocko(a)suse.com
CC: iamjoonsoo.kim(a)lge.com
CC: vbabka(a)suse.cz
CC: alex.shi(a)linux.alibaba.com
CC: willy(a)infradead.org
CC: minchan(a)kernel.org
CC: richard.weiyang(a)gmail.com
CC: ying.huang(a)intel.com
Hi Miaohe,
Thank you for the patch! Perhaps something to improve:
[auto build test WARNING on linux/master]
[also build test WARNING on linus/master hnaz-linux-mm/master v5.12-rc6 next-20210408]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]
url: https://github.com/0day-ci/linux/commits/Miaohe-Lin/close-various-race-windows-for-swap/20210408-211224
base: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 5e46d1b78a03d52306f21f77a4e4a144b6d31486
:::::: branch date: 8 hours ago
:::::: commit date: 8 hours ago
config: powerpc-randconfig-s032-20210408 (attached as .config)
compiler: powerpc64-linux-gcc (GCC) 9.3.0
reproduce:
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# apt-get install sparse
# sparse version: v0.6.3-279-g6d5d9b42-dirty
# https://github.com/0day-ci/linux/commit/56e65e21c8c9858e36c3bca84006a15fe9b85efd
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Miaohe-Lin/close-various-race-windows-for-swap/20210408-211224
git checkout 56e65e21c8c9858e36c3bca84006a15fe9b85efd
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' ARCH=powerpc
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>
sparse warnings: (new ones prefixed by >>)
mm/swapfile.c:488:35: sparse: sparse: context imbalance in 'swap_do_scheduled_discard' - different lock contexts for basic block
mm/swapfile.c:664:9: sparse: sparse: context imbalance in 'scan_swap_map_try_ssd_cluster' - different lock contexts for basic block
mm/swapfile.c:954:20: sparse: sparse: context imbalance in 'scan_swap_map_slots' - unexpected unlock
mm/swapfile.c:1037:23: sparse: sparse: context imbalance in 'swap_free_cluster' - different lock contexts for basic block
mm/swapfile.c:1218:9: sparse: sparse: context imbalance in 'swap_info_get' - wrong count at exit
mm/swapfile.c:1230:36: sparse: sparse: context imbalance in 'swap_info_get_cont' - unexpected unlock
mm/swapfile.c:384:9: sparse: sparse: context imbalance in '__swap_entry_free' - different lock contexts for basic block
mm/swapfile.c:1361:23: sparse: sparse: context imbalance in 'swap_entry_free' - different lock contexts for basic block
mm/swapfile.c:1418:34: sparse: sparse: context imbalance in 'put_swap_page' - different lock contexts for basic block
mm/swapfile.c:1479:28: sparse: sparse: context imbalance in 'swapcache_free_entries' - unexpected unlock
mm/swapfile.c:384:9: sparse: sparse: context imbalance in 'page_swapcount' - different lock contexts for basic block
mm/swapfile.c:384:9: sparse: sparse: context imbalance in 'swap_swapcount' - different lock contexts for basic block
mm/swapfile.c:384:9: sparse: sparse: context imbalance in 'swp_swapcount' - different lock contexts for basic block
mm/swapfile.c:384:9: sparse: sparse: context imbalance in 'swap_page_trans_huge_swapped' - different lock contexts for basic block
mm/swapfile.c:1737:44: sparse: sparse: context imbalance in 'reuse_swap_page' - unexpected unlock
mm/swapfile.c:384:9: sparse: sparse: context imbalance in '__swap_duplicate' - different lock contexts for basic block
>> mm/swapfile.c:3673:23: sparse: sparse: context imbalance in 'add_swap_count_continuation' - different lock contexts for basic block
vim +/add_swap_count_continuation +3673 mm/swapfile.c
f981c5950fa859 Mel Gorman 2012-07-31 3563
570a335b8e2257 Hugh Dickins 2009-12-14 3564 /*
570a335b8e2257 Hugh Dickins 2009-12-14 3565 * add_swap_count_continuation - called when a swap count is duplicated
570a335b8e2257 Hugh Dickins 2009-12-14 3566 * beyond SWAP_MAP_MAX, it allocates a new page and links that to the entry's
570a335b8e2257 Hugh Dickins 2009-12-14 3567 * page of the original vmalloc'ed swap_map, to hold the continuation count
570a335b8e2257 Hugh Dickins 2009-12-14 3568 * (for that entry and for its neighbouring PAGE_SIZE swap entries). Called
570a335b8e2257 Hugh Dickins 2009-12-14 3569 * again when count is duplicated beyond SWAP_MAP_MAX * SWAP_CONT_MAX, etc.
570a335b8e2257 Hugh Dickins 2009-12-14 3570 *
570a335b8e2257 Hugh Dickins 2009-12-14 3571 * These continuation pages are seldom referenced: the common paths all work
570a335b8e2257 Hugh Dickins 2009-12-14 3572 * on the original swap_map, only referring to a continuation page when the
570a335b8e2257 Hugh Dickins 2009-12-14 3573 * low "digit" of a count is incremented or decremented through SWAP_MAP_MAX.
570a335b8e2257 Hugh Dickins 2009-12-14 3574 *
570a335b8e2257 Hugh Dickins 2009-12-14 3575 * add_swap_count_continuation(, GFP_ATOMIC) can be called while holding
570a335b8e2257 Hugh Dickins 2009-12-14 3576 * page table locks; if it fails, add_swap_count_continuation(, GFP_KERNEL)
570a335b8e2257 Hugh Dickins 2009-12-14 3577 * can be called after dropping locks.
570a335b8e2257 Hugh Dickins 2009-12-14 3578 */
570a335b8e2257 Hugh Dickins 2009-12-14 3579 int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
570a335b8e2257 Hugh Dickins 2009-12-14 3580 {
570a335b8e2257 Hugh Dickins 2009-12-14 3581 struct swap_info_struct *si;
235b62176712b9 Huang, Ying 2017-02-22 3582 struct swap_cluster_info *ci;
570a335b8e2257 Hugh Dickins 2009-12-14 3583 struct page *head;
570a335b8e2257 Hugh Dickins 2009-12-14 3584 struct page *page;
570a335b8e2257 Hugh Dickins 2009-12-14 3585 struct page *list_page;
570a335b8e2257 Hugh Dickins 2009-12-14 3586 pgoff_t offset;
570a335b8e2257 Hugh Dickins 2009-12-14 3587 unsigned char count;
eb085574a7526c Huang Ying 2019-07-11 3588 int ret = 0;
570a335b8e2257 Hugh Dickins 2009-12-14 3589
570a335b8e2257 Hugh Dickins 2009-12-14 3590 /*
570a335b8e2257 Hugh Dickins 2009-12-14 3591 * When debugging, it's easier to use __GFP_ZERO here; but it's better
570a335b8e2257 Hugh Dickins 2009-12-14 3592 * for latency not to zero a page while GFP_ATOMIC and holding locks.
570a335b8e2257 Hugh Dickins 2009-12-14 3593 */
570a335b8e2257 Hugh Dickins 2009-12-14 3594 page = alloc_page(gfp_mask | __GFP_HIGHMEM);
570a335b8e2257 Hugh Dickins 2009-12-14 3595
eb085574a7526c Huang Ying 2019-07-11 3596 si = get_swap_device(entry);
570a335b8e2257 Hugh Dickins 2009-12-14 3597 if (!si) {
570a335b8e2257 Hugh Dickins 2009-12-14 3598 /*
570a335b8e2257 Hugh Dickins 2009-12-14 3599 * An acceptable race has occurred since the failing
eb085574a7526c Huang Ying 2019-07-11 3600 * __swap_duplicate(): the swap device may be swapoff
570a335b8e2257 Hugh Dickins 2009-12-14 3601 */
570a335b8e2257 Hugh Dickins 2009-12-14 3602 goto outer;
570a335b8e2257 Hugh Dickins 2009-12-14 3603 }
eb085574a7526c Huang Ying 2019-07-11 3604 spin_lock(&si->lock);
570a335b8e2257 Hugh Dickins 2009-12-14 3605
570a335b8e2257 Hugh Dickins 2009-12-14 3606 offset = swp_offset(entry);
235b62176712b9 Huang, Ying 2017-02-22 3607
235b62176712b9 Huang, Ying 2017-02-22 3608 ci = lock_cluster(si, offset);
235b62176712b9 Huang, Ying 2017-02-22 3609
d8aa24e04fb2a7 Miaohe Lin 2020-12-14 3610 count = swap_count(si->swap_map[offset]);
570a335b8e2257 Hugh Dickins 2009-12-14 3611
570a335b8e2257 Hugh Dickins 2009-12-14 3612 if ((count & ~COUNT_CONTINUED) != SWAP_MAP_MAX) {
570a335b8e2257 Hugh Dickins 2009-12-14 3613 /*
570a335b8e2257 Hugh Dickins 2009-12-14 3614 * The higher the swap count, the more likely it is that tasks
570a335b8e2257 Hugh Dickins 2009-12-14 3615 * will race to add swap count continuation: we need to avoid
570a335b8e2257 Hugh Dickins 2009-12-14 3616 * over-provisioning.
570a335b8e2257 Hugh Dickins 2009-12-14 3617 */
570a335b8e2257 Hugh Dickins 2009-12-14 3618 goto out;
570a335b8e2257 Hugh Dickins 2009-12-14 3619 }
570a335b8e2257 Hugh Dickins 2009-12-14 3620
570a335b8e2257 Hugh Dickins 2009-12-14 3621 if (!page) {
eb085574a7526c Huang Ying 2019-07-11 3622 ret = -ENOMEM;
eb085574a7526c Huang Ying 2019-07-11 3623 goto out;
570a335b8e2257 Hugh Dickins 2009-12-14 3624 }
570a335b8e2257 Hugh Dickins 2009-12-14 3625
570a335b8e2257 Hugh Dickins 2009-12-14 3626 /*
570a335b8e2257 Hugh Dickins 2009-12-14 3627 * We are fortunate that although vmalloc_to_page uses pte_offset_map,
570a335b8e2257 Hugh Dickins 2009-12-14 3628 * no architecture is using highmem pages for kernel page tables: so it
570a335b8e2257 Hugh Dickins 2009-12-14 3629 * will not corrupt the GFP_ATOMIC caller's atomic page table kmaps.
570a335b8e2257 Hugh Dickins 2009-12-14 3630 */
570a335b8e2257 Hugh Dickins 2009-12-14 3631 head = vmalloc_to_page(si->swap_map + offset);
570a335b8e2257 Hugh Dickins 2009-12-14 3632 offset &= ~PAGE_MASK;
570a335b8e2257 Hugh Dickins 2009-12-14 3633
2628bd6fc052bd Huang Ying 2017-11-02 3634 spin_lock(&si->cont_lock);
570a335b8e2257 Hugh Dickins 2009-12-14 3635 /*
570a335b8e2257 Hugh Dickins 2009-12-14 3636 * Page allocation does not initialize the page's lru field,
570a335b8e2257 Hugh Dickins 2009-12-14 3637 * but it does always reset its private field.
570a335b8e2257 Hugh Dickins 2009-12-14 3638 */
570a335b8e2257 Hugh Dickins 2009-12-14 3639 if (!page_private(head)) {
570a335b8e2257 Hugh Dickins 2009-12-14 3640 BUG_ON(count & COUNT_CONTINUED);
570a335b8e2257 Hugh Dickins 2009-12-14 3641 INIT_LIST_HEAD(&head->lru);
570a335b8e2257 Hugh Dickins 2009-12-14 3642 set_page_private(head, SWP_CONTINUED);
570a335b8e2257 Hugh Dickins 2009-12-14 3643 si->flags |= SWP_CONTINUED;
570a335b8e2257 Hugh Dickins 2009-12-14 3644 }
570a335b8e2257 Hugh Dickins 2009-12-14 3645
570a335b8e2257 Hugh Dickins 2009-12-14 3646 list_for_each_entry(list_page, &head->lru, lru) {
570a335b8e2257 Hugh Dickins 2009-12-14 3647 unsigned char *map;
570a335b8e2257 Hugh Dickins 2009-12-14 3648
570a335b8e2257 Hugh Dickins 2009-12-14 3649 /*
570a335b8e2257 Hugh Dickins 2009-12-14 3650 * If the previous map said no continuation, but we've found
570a335b8e2257 Hugh Dickins 2009-12-14 3651 * a continuation page, free our allocation and use this one.
570a335b8e2257 Hugh Dickins 2009-12-14 3652 */
570a335b8e2257 Hugh Dickins 2009-12-14 3653 if (!(count & COUNT_CONTINUED))
2628bd6fc052bd Huang Ying 2017-11-02 3654 goto out_unlock_cont;
570a335b8e2257 Hugh Dickins 2009-12-14 3655
9b04c5fec43c0d Cong Wang 2011-11-25 3656 map = kmap_atomic(list_page) + offset;
570a335b8e2257 Hugh Dickins 2009-12-14 3657 count = *map;
9b04c5fec43c0d Cong Wang 2011-11-25 3658 kunmap_atomic(map);
570a335b8e2257 Hugh Dickins 2009-12-14 3659
570a335b8e2257 Hugh Dickins 2009-12-14 3660 /*
570a335b8e2257 Hugh Dickins 2009-12-14 3661 * If this continuation count now has some space in it,
570a335b8e2257 Hugh Dickins 2009-12-14 3662 * free our allocation and use this one.
570a335b8e2257 Hugh Dickins 2009-12-14 3663 */
570a335b8e2257 Hugh Dickins 2009-12-14 3664 if ((count & ~COUNT_CONTINUED) != SWAP_CONT_MAX)
2628bd6fc052bd Huang Ying 2017-11-02 3665 goto out_unlock_cont;
570a335b8e2257 Hugh Dickins 2009-12-14 3666 }
570a335b8e2257 Hugh Dickins 2009-12-14 3667
570a335b8e2257 Hugh Dickins 2009-12-14 3668 list_add_tail(&page->lru, &head->lru);
570a335b8e2257 Hugh Dickins 2009-12-14 3669 page = NULL; /* now it's attached, don't free it */
2628bd6fc052bd Huang Ying 2017-11-02 3670 out_unlock_cont:
2628bd6fc052bd Huang Ying 2017-11-02 3671 spin_unlock(&si->cont_lock);
570a335b8e2257 Hugh Dickins 2009-12-14 3672 out:
235b62176712b9 Huang, Ying 2017-02-22 @3673 unlock_cluster(ci);
ec8acf20afb853 Shaohua Li 2013-02-22 3674 spin_unlock(&si->lock);
eb085574a7526c Huang Ying 2019-07-11 3675 put_swap_device(si);
570a335b8e2257 Hugh Dickins 2009-12-14 3676 outer:
570a335b8e2257 Hugh Dickins 2009-12-14 3677 if (page)
570a335b8e2257 Hugh Dickins 2009-12-14 3678 __free_page(page);
eb085574a7526c Huang Ying 2019-07-11 3679 return ret;
570a335b8e2257 Hugh Dickins 2009-12-14 3680 }
570a335b8e2257 Hugh Dickins 2009-12-14 3681
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org
[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 30996 bytes --]
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff
2021-04-14 16:13 ` Tim Chen
@ 2021-04-15 3:19 ` Miaohe Lin
0 siblings, 0 replies; 24+ messages in thread
From: Miaohe Lin @ 2021-04-15 3:19 UTC (permalink / raw)
To: Tim Chen, Huang, Ying
Cc: akpm, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, willy,
minchan, richard.weiyang, hughd, linux-kernel, linux-mm
On 2021/4/15 0:13, Tim Chen wrote:
>
>
> On 4/13/21 6:04 PM, Huang, Ying wrote:
>> Tim Chen <tim.c.chen@linux.intel.com> writes:
>>
>>> On 4/12/21 6:27 PM, Huang, Ying wrote:
>>>
>>>>
>>>> This isn't the commit that introduces the race. You can use `git blame`
>>>> find out the correct commit. For this it's commit 0bcac06f27d7 "mm,
>>>> swap: skip swapcache for swapin of synchronous device".
>>>>
>>>> And I suggest to merge 1/5 and 2/5 to make it easy to get the full
>>>> picture.
>>>
>>> I'll suggest make fix to do_swap_page race with get/put_swap_device
>>> as a first patch. Then the per_cpu_ref stuff in patch 1 and patch 2 can
>>> be combined together.
>>
>> The original get/put_swap_device() use rcu_read_lock/unlock(). I don't
>> think it's good to wrap swap_read_page() with it. After all, some
>> complex operations are done in swap_read_page(), including
>> blk_io_schedule().
>>
>
> In that case then have the patches to make get/put_swap_device to use
> percpu_ref first. And the patch to to fix the race in do_swap_page
> later in another patch.
>
> Patch 2 is mixing the two.
>
Looks like a good way to organize this patch series. Many thanks!
> Tim
> .
>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff
2021-04-14 1:04 ` Huang, Ying
(?)
(?)
@ 2021-04-14 16:13 ` Tim Chen
2021-04-15 3:19 ` Miaohe Lin
-1 siblings, 1 reply; 24+ messages in thread
From: Tim Chen @ 2021-04-14 16:13 UTC (permalink / raw)
To: Huang, Ying
Cc: Miaohe Lin, akpm, hannes, mhocko, iamjoonsoo.kim, vbabka,
alex.shi, willy, minchan, richard.weiyang, hughd, linux-kernel,
linux-mm
On 4/13/21 6:04 PM, Huang, Ying wrote:
> Tim Chen <tim.c.chen@linux.intel.com> writes:
>
>> On 4/12/21 6:27 PM, Huang, Ying wrote:
>>
>>>
>>> This isn't the commit that introduces the race. You can use `git blame`
>>> find out the correct commit. For this it's commit 0bcac06f27d7 "mm,
>>> swap: skip swapcache for swapin of synchronous device".
>>>
>>> And I suggest to merge 1/5 and 2/5 to make it easy to get the full
>>> picture.
>>
>> I'll suggest make fix to do_swap_page race with get/put_swap_device
>> as a first patch. Then the per_cpu_ref stuff in patch 1 and patch 2 can
>> be combined together.
>
> The original get/put_swap_device() use rcu_read_lock/unlock(). I don't
> think it's good to wrap swap_read_page() with it. After all, some
> complex operations are done in swap_read_page(), including
> blk_io_schedule().
>
In that case then have the patches to make get/put_swap_device to use
percpu_ref first. And the patch to to fix the race in do_swap_page
later in another patch.
Patch 2 is mixing the two.
Tim
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff
2021-04-14 3:07 ` Huang, Ying
(?)
@ 2021-04-14 3:27 ` Miaohe Lin
-1 siblings, 0 replies; 24+ messages in thread
From: Miaohe Lin @ 2021-04-14 3:27 UTC (permalink / raw)
To: Huang, Ying
Cc: akpm, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, willy,
minchan, richard.weiyang, hughd, tim.c.chen, linux-kernel,
linux-mm
On 2021/4/14 11:07, Huang, Ying wrote:
> Miaohe Lin <linmiaohe@huawei.com> writes:
>
>> On 2021/4/13 9:27, Huang, Ying wrote:
>>> Miaohe Lin <linmiaohe@huawei.com> writes:
>>>
>>>> When I was investigating the swap code, I found the below possible race
>>>> window:
>>>>
>>>> CPU 1 CPU 2
>>>> ----- -----
>>>> do_swap_page
>>>> synchronous swap_readpage
>>>> alloc_page_vma
>>>> swapoff
>>>> release swap_file, bdev, or ...
>>>> swap_readpage
>>>> check sis->flags is ok
>>>> access swap_file, bdev...[oops!]
>>>> si->flags = 0
>>>>
>>>> Using current get/put_swap_device() to guard against concurrent swapoff for
>>>> swap_readpage() looks terrible because swap_readpage() may take really long
>>>> time. And this race may not be really pernicious because swapoff is usually
>>>> done when system shutdown only. To reduce the performance overhead on the
>>>> hot-path as much as possible, it appears we can use the percpu_ref to close
>>>> this race window(as suggested by Huang, Ying).
>>>>
>>>> Fixes: 235b62176712 ("mm/swap: add cluster lock")
>>>
>>> This isn't the commit that introduces the race. You can use `git blame`
>>> find out the correct commit. For this it's commit 0bcac06f27d7 "mm,
>>> swap: skip swapcache for swapin of synchronous device".
>>>
>>
>> Sorry about it! What I refer to is commit eb085574a752 ("mm, swap: fix race between
>> swapoff and some swap operations"). And I think this commit does not fix the race
>> condition completely, so I reuse the Fixes tag inside it.
>>
>>> And I suggest to merge 1/5 and 2/5 to make it easy to get the full
>>> picture.
>>>
>>>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>>>> ---
>>>> include/linux/swap.h | 2 +-
>>>> mm/memory.c | 10 ++++++++++
>>>> mm/swapfile.c | 28 +++++++++++-----------------
>>>> 3 files changed, 22 insertions(+), 18 deletions(-)
>>>>
>>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>>> index 849ba5265c11..9066addb57fd 100644
>>>> --- a/include/linux/swap.h
>>>> +++ b/include/linux/swap.h
>>>> @@ -513,7 +513,7 @@ sector_t swap_page_sector(struct page *page);
>>>>
>>>> static inline void put_swap_device(struct swap_info_struct *si)
>>>> {
>>>> - rcu_read_unlock();
>>>> + percpu_ref_put(&si->users);
>>>> }
>>>>
>>>> #else /* CONFIG_SWAP */
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index cc71a445c76c..8543c47b955c 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -3311,6 +3311,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>> {
>>>> struct vm_area_struct *vma = vmf->vma;
>>>> struct page *page = NULL, *swapcache;
>>>> + struct swap_info_struct *si = NULL;
>>>> swp_entry_t entry;
>>>> pte_t pte;
>>>> int locked;
>>>> @@ -3339,6 +3340,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>> }
>>>>
>>>>
>>>
>>> I suggest to add comments here as follows (words copy from Matthew Wilcox)
>>>
>>> /* Prevent swapoff from happening to us */
>>
>> Ok.
>>
>>>
>>>> + si = get_swap_device(entry);
>>>> + /* In case we raced with swapoff. */
>>>> + if (unlikely(!si))
>>>> + goto out;
>>>> +
>>>
>>> Because we wrap the whole do_swap_page() with get/put_swap_device()
>>> now. We can remove several get/put_swap_device() for function called by
>>> do_swap_page(). That can be another optimization patch.
>>
>> I tried to remove several get/put_swap_device() for function called
>> by do_swap_page() only before I send this series. But it seems they have
>> other callers without proper get/put_swap_device().
>
> Then we need to revise these callers instead. Anyway, can be another
> series.
Yes. can be another series.
Thanks.
>
> Best Regards,
> Huang, Ying
>
> .
>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff
2021-04-14 2:55 ` Miaohe Lin
@ 2021-04-14 3:07 ` Huang, Ying
0 siblings, 0 replies; 24+ messages in thread
From: Huang, Ying @ 2021-04-14 3:07 UTC (permalink / raw)
To: Miaohe Lin
Cc: akpm, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, willy,
minchan, richard.weiyang, hughd, tim.c.chen, linux-kernel,
linux-mm
Miaohe Lin <linmiaohe@huawei.com> writes:
> On 2021/4/13 9:27, Huang, Ying wrote:
>> Miaohe Lin <linmiaohe@huawei.com> writes:
>>
>>> When I was investigating the swap code, I found the below possible race
>>> window:
>>>
>>> CPU 1 CPU 2
>>> ----- -----
>>> do_swap_page
>>> synchronous swap_readpage
>>> alloc_page_vma
>>> swapoff
>>> release swap_file, bdev, or ...
>>> swap_readpage
>>> check sis->flags is ok
>>> access swap_file, bdev...[oops!]
>>> si->flags = 0
>>>
>>> Using current get/put_swap_device() to guard against concurrent swapoff for
>>> swap_readpage() looks terrible because swap_readpage() may take really long
>>> time. And this race may not be really pernicious because swapoff is usually
>>> done when system shutdown only. To reduce the performance overhead on the
>>> hot-path as much as possible, it appears we can use the percpu_ref to close
>>> this race window(as suggested by Huang, Ying).
>>>
>>> Fixes: 235b62176712 ("mm/swap: add cluster lock")
>>
>> This isn't the commit that introduces the race. You can use `git blame`
>> find out the correct commit. For this it's commit 0bcac06f27d7 "mm,
>> swap: skip swapcache for swapin of synchronous device".
>>
>
> Sorry about it! What I refer to is commit eb085574a752 ("mm, swap: fix race between
> swapoff and some swap operations"). And I think this commit does not fix the race
> condition completely, so I reuse the Fixes tag inside it.
>
>> And I suggest to merge 1/5 and 2/5 to make it easy to get the full
>> picture.
>>
>>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>>> ---
>>> include/linux/swap.h | 2 +-
>>> mm/memory.c | 10 ++++++++++
>>> mm/swapfile.c | 28 +++++++++++-----------------
>>> 3 files changed, 22 insertions(+), 18 deletions(-)
>>>
>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>> index 849ba5265c11..9066addb57fd 100644
>>> --- a/include/linux/swap.h
>>> +++ b/include/linux/swap.h
>>> @@ -513,7 +513,7 @@ sector_t swap_page_sector(struct page *page);
>>>
>>> static inline void put_swap_device(struct swap_info_struct *si)
>>> {
>>> - rcu_read_unlock();
>>> + percpu_ref_put(&si->users);
>>> }
>>>
>>> #else /* CONFIG_SWAP */
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index cc71a445c76c..8543c47b955c 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -3311,6 +3311,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>> {
>>> struct vm_area_struct *vma = vmf->vma;
>>> struct page *page = NULL, *swapcache;
>>> + struct swap_info_struct *si = NULL;
>>> swp_entry_t entry;
>>> pte_t pte;
>>> int locked;
>>> @@ -3339,6 +3340,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>> }
>>>
>>>
>>
>> I suggest to add comments here as follows (words copy from Matthew Wilcox)
>>
>> /* Prevent swapoff from happening to us */
>
> Ok.
>
>>
>>> + si = get_swap_device(entry);
>>> + /* In case we raced with swapoff. */
>>> + if (unlikely(!si))
>>> + goto out;
>>> +
>>
>> Because we wrap the whole do_swap_page() with get/put_swap_device()
>> now. We can remove several get/put_swap_device() for function called by
>> do_swap_page(). That can be another optimization patch.
>
> I tried to remove several get/put_swap_device() for function called
> by do_swap_page() only before I send this series. But it seems they have
> other callers without proper get/put_swap_device().
Then we need to revise these callers instead. Anyway, can be another
series.
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff
@ 2021-04-14 3:07 ` Huang, Ying
0 siblings, 0 replies; 24+ messages in thread
From: Huang, Ying @ 2021-04-14 3:07 UTC (permalink / raw)
To: Miaohe Lin
Cc: akpm, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, willy,
minchan, richard.weiyang, hughd, tim.c.chen, linux-kernel,
linux-mm
Miaohe Lin <linmiaohe@huawei.com> writes:
> On 2021/4/13 9:27, Huang, Ying wrote:
>> Miaohe Lin <linmiaohe@huawei.com> writes:
>>
>>> When I was investigating the swap code, I found the below possible race
>>> window:
>>>
>>> CPU 1 CPU 2
>>> ----- -----
>>> do_swap_page
>>> synchronous swap_readpage
>>> alloc_page_vma
>>> swapoff
>>> release swap_file, bdev, or ...
>>> swap_readpage
>>> check sis->flags is ok
>>> access swap_file, bdev...[oops!]
>>> si->flags = 0
>>>
>>> Using current get/put_swap_device() to guard against concurrent swapoff for
>>> swap_readpage() looks terrible because swap_readpage() may take really long
>>> time. And this race may not be really pernicious because swapoff is usually
>>> done when system shutdown only. To reduce the performance overhead on the
>>> hot-path as much as possible, it appears we can use the percpu_ref to close
>>> this race window(as suggested by Huang, Ying).
>>>
>>> Fixes: 235b62176712 ("mm/swap: add cluster lock")
>>
>> This isn't the commit that introduces the race. You can use `git blame`
>> find out the correct commit. For this it's commit 0bcac06f27d7 "mm,
>> swap: skip swapcache for swapin of synchronous device".
>>
>
> Sorry about it! What I refer to is commit eb085574a752 ("mm, swap: fix race between
> swapoff and some swap operations"). And I think this commit does not fix the race
> condition completely, so I reuse the Fixes tag inside it.
>
>> And I suggest to merge 1/5 and 2/5 to make it easy to get the full
>> picture.
>>
>>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>>> ---
>>> include/linux/swap.h | 2 +-
>>> mm/memory.c | 10 ++++++++++
>>> mm/swapfile.c | 28 +++++++++++-----------------
>>> 3 files changed, 22 insertions(+), 18 deletions(-)
>>>
>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>> index 849ba5265c11..9066addb57fd 100644
>>> --- a/include/linux/swap.h
>>> +++ b/include/linux/swap.h
>>> @@ -513,7 +513,7 @@ sector_t swap_page_sector(struct page *page);
>>>
>>> static inline void put_swap_device(struct swap_info_struct *si)
>>> {
>>> - rcu_read_unlock();
>>> + percpu_ref_put(&si->users);
>>> }
>>>
>>> #else /* CONFIG_SWAP */
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index cc71a445c76c..8543c47b955c 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -3311,6 +3311,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>> {
>>> struct vm_area_struct *vma = vmf->vma;
>>> struct page *page = NULL, *swapcache;
>>> + struct swap_info_struct *si = NULL;
>>> swp_entry_t entry;
>>> pte_t pte;
>>> int locked;
>>> @@ -3339,6 +3340,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>> }
>>>
>>>
>>
>> I suggest to add comments here as follows (words copy from Matthew Wilcox)
>>
>> /* Prevent swapoff from happening to us */
>
> Ok.
>
>>
>>> + si = get_swap_device(entry);
>>> + /* In case we raced with swapoff. */
>>> + if (unlikely(!si))
>>> + goto out;
>>> +
>>
>> Because we wrap the whole do_swap_page() with get/put_swap_device()
>> now. We can remove several get/put_swap_device() for function called by
>> do_swap_page(). That can be another optimization patch.
>
> I tried to remove several get/put_swap_device() for function called
> by do_swap_page() only before I send this series. But it seems they have
> other callers without proper get/put_swap_device().
Then we need to revise these callers instead. Anyway, can be another
series.
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff
2021-04-13 1:27 ` Huang, Ying
(?)
(?)
@ 2021-04-14 2:55 ` Miaohe Lin
2021-04-14 3:07 ` Huang, Ying
-1 siblings, 1 reply; 24+ messages in thread
From: Miaohe Lin @ 2021-04-14 2:55 UTC (permalink / raw)
To: Huang, Ying
Cc: akpm, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, willy,
minchan, richard.weiyang, hughd, tim.c.chen, linux-kernel,
linux-mm
On 2021/4/13 9:27, Huang, Ying wrote:
> Miaohe Lin <linmiaohe@huawei.com> writes:
>
>> When I was investigating the swap code, I found the below possible race
>> window:
>>
>> CPU 1 CPU 2
>> ----- -----
>> do_swap_page
>> synchronous swap_readpage
>> alloc_page_vma
>> swapoff
>> release swap_file, bdev, or ...
>> swap_readpage
>> check sis->flags is ok
>> access swap_file, bdev...[oops!]
>> si->flags = 0
>>
>> Using current get/put_swap_device() to guard against concurrent swapoff for
>> swap_readpage() looks terrible because swap_readpage() may take really long
>> time. And this race may not be really pernicious because swapoff is usually
>> done when system shutdown only. To reduce the performance overhead on the
>> hot-path as much as possible, it appears we can use the percpu_ref to close
>> this race window(as suggested by Huang, Ying).
>>
>> Fixes: 235b62176712 ("mm/swap: add cluster lock")
>
> This isn't the commit that introduces the race. You can use `git blame`
> find out the correct commit. For this it's commit 0bcac06f27d7 "mm,
> swap: skip swapcache for swapin of synchronous device".
>
Sorry about it! What I refer to is commit eb085574a752 ("mm, swap: fix race between
swapoff and some swap operations"). And I think this commit does not fix the race
condition completely, so I reuse the Fixes tag inside it.
> And I suggest to merge 1/5 and 2/5 to make it easy to get the full
> picture.
>
>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>> ---
>> include/linux/swap.h | 2 +-
>> mm/memory.c | 10 ++++++++++
>> mm/swapfile.c | 28 +++++++++++-----------------
>> 3 files changed, 22 insertions(+), 18 deletions(-)
>>
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index 849ba5265c11..9066addb57fd 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -513,7 +513,7 @@ sector_t swap_page_sector(struct page *page);
>>
>> static inline void put_swap_device(struct swap_info_struct *si)
>> {
>> - rcu_read_unlock();
>> + percpu_ref_put(&si->users);
>> }
>>
>> #else /* CONFIG_SWAP */
>> diff --git a/mm/memory.c b/mm/memory.c
>> index cc71a445c76c..8543c47b955c 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -3311,6 +3311,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> {
>> struct vm_area_struct *vma = vmf->vma;
>> struct page *page = NULL, *swapcache;
>> + struct swap_info_struct *si = NULL;
>> swp_entry_t entry;
>> pte_t pte;
>> int locked;
>> @@ -3339,6 +3340,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> }
>>
>>
>
> I suggest to add comments here as follows (words copy from Matthew Wilcox)
>
> /* Prevent swapoff from happening to us */
Ok.
>
>> + si = get_swap_device(entry);
>> + /* In case we raced with swapoff. */
>> + if (unlikely(!si))
>> + goto out;
>> +
>
> Because we wrap the whole do_swap_page() with get/put_swap_device()
> now. We can remove several get/put_swap_device() for function called by
> do_swap_page(). That can be another optimization patch.
I tried to remove several get/put_swap_device() for function called
by do_swap_page() only before I send this series. But it seems they have
other callers without proper get/put_swap_device().
>
>> delayacct_set_flag(DELAYACCT_PF_SWAPIN);
>> page = lookup_swap_cache(entry, vma, vmf->address);
>> swapcache = page;
>> @@ -3514,6 +3520,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> unlock:
>> pte_unmap_unlock(vmf->pte, vmf->ptl);
>> out:
>> + if (si)
>> + put_swap_device(si);
>> return ret;
>> out_nomap:
>> pte_unmap_unlock(vmf->pte, vmf->ptl);
>> @@ -3525,6 +3533,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>> unlock_page(swapcache);
>> put_page(swapcache);
>> }
>> + if (si)
>> + put_swap_device(si);
>> return ret;
>> }
>>
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index 724173cd7d0c..01032c72ceae 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -1280,18 +1280,12 @@ static unsigned char __swap_entry_free_locked(struct swap_info_struct *p,
>> * via preventing the swap device from being swapoff, until
>> * put_swap_device() is called. Otherwise return NULL.
>> *
>> - * The entirety of the RCU read critical section must come before the
>> - * return from or after the call to synchronize_rcu() in
>> - * enable_swap_info() or swapoff(). So if "si->flags & SWP_VALID" is
>> - * true, the si->map, si->cluster_info, etc. must be valid in the
>> - * critical section.
>> - *
>> * Notice that swapoff or swapoff+swapon can still happen before the
>> - * rcu_read_lock() in get_swap_device() or after the rcu_read_unlock()
>> - * in put_swap_device() if there isn't any other way to prevent
>> - * swapoff, such as page lock, page table lock, etc. The caller must
>> - * be prepared for that. For example, the following situation is
>> - * possible.
>> + * percpu_ref_tryget_live() in get_swap_device() or after the
>> + * percpu_ref_put() in put_swap_device() if there isn't any other way
>> + * to prevent swapoff, such as page lock, page table lock, etc. The
>> + * caller must be prepared for that. For example, the following
>> + * situation is possible.
>> *
>> * CPU1 CPU2
>> * do_swap_page()
>> @@ -1319,21 +1313,21 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry)
>> si = swp_swap_info(entry);
>> if (!si)
>> goto bad_nofile;
>> -
>> - rcu_read_lock();
>> if (data_race(!(si->flags & SWP_VALID)))
>
> We can delete SWP_VALID, that is used together with RCU solution.
Will do.
>
>> - goto unlock_out;
>> + goto out;
>> + if (!percpu_ref_tryget_live(&si->users))
>> + goto out;
>> offset = swp_offset(entry);
>> if (offset >= si->max)
>> - goto unlock_out;
>> + goto put_out;
>>
>> return si;
>> bad_nofile:
>> pr_err("%s: %s%08lx\n", __func__, Bad_file, entry.val);
>> out:
>> return NULL;
>> -unlock_out:
>> - rcu_read_unlock();
>> +put_out:
>> + percpu_ref_put(&si->users);
>> return NULL;
>> }
>
Many thanks.
> Best Regards,
> Huang, Ying
> .
>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff
2021-04-14 1:04 ` Huang, Ying
(?)
@ 2021-04-14 2:20 ` Miaohe Lin
-1 siblings, 0 replies; 24+ messages in thread
From: Miaohe Lin @ 2021-04-14 2:20 UTC (permalink / raw)
To: Huang, Ying, Tim Chen
Cc: akpm, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, willy,
minchan, richard.weiyang, hughd, linux-kernel, linux-mm
On 2021/4/14 9:04, Huang, Ying wrote:
> Tim Chen <tim.c.chen@linux.intel.com> writes:
>
>> On 4/12/21 6:27 PM, Huang, Ying wrote:
>>
>>>
>>> This isn't the commit that introduces the race. You can use `git blame`
>>> find out the correct commit. For this it's commit 0bcac06f27d7 "mm,
>>> swap: skip swapcache for swapin of synchronous device".
>>>
>>> And I suggest to merge 1/5 and 2/5 to make it easy to get the full
>>> picture.
>>
>> I'll suggest make fix to do_swap_page race with get/put_swap_device
>> as a first patch. Then the per_cpu_ref stuff in patch 1 and patch 2 can
>> be combined together.
>
> The original get/put_swap_device() use rcu_read_lock/unlock(). I don't
> think it's good to wrap swap_read_page() with it. After all, some
> complex operations are done in swap_read_page(), including
> blk_io_schedule().
>
The patch was split to make it easier to review originally, i.e. 1/5 introduces
the percpu_ref to swap and 2/5 uses it to fix the race between do_swap_page()
and swapoff.
Btw, I have no preference for merging 1/5 and 2/5 or not.
> Best Regards,
> Huang, Ying
> .
>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff
2021-04-13 19:24 ` Tim Chen
@ 2021-04-14 1:04 ` Huang, Ying
0 siblings, 0 replies; 24+ messages in thread
From: Huang, Ying @ 2021-04-14 1:04 UTC (permalink / raw)
To: Tim Chen
Cc: Miaohe Lin, akpm, hannes, mhocko, iamjoonsoo.kim, vbabka,
alex.shi, willy, minchan, richard.weiyang, hughd, linux-kernel,
linux-mm
Tim Chen <tim.c.chen@linux.intel.com> writes:
> On 4/12/21 6:27 PM, Huang, Ying wrote:
>
>>
>> This isn't the commit that introduces the race. You can use `git blame`
>> find out the correct commit. For this it's commit 0bcac06f27d7 "mm,
>> swap: skip swapcache for swapin of synchronous device".
>>
>> And I suggest to merge 1/5 and 2/5 to make it easy to get the full
>> picture.
>
> I'll suggest make fix to do_swap_page race with get/put_swap_device
> as a first patch. Then the per_cpu_ref stuff in patch 1 and patch 2 can
> be combined together.
The original get/put_swap_device() use rcu_read_lock/unlock(). I don't
think it's good to wrap swap_read_page() with it. After all, some
complex operations are done in swap_read_page(), including
blk_io_schedule().
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff
@ 2021-04-14 1:04 ` Huang, Ying
0 siblings, 0 replies; 24+ messages in thread
From: Huang, Ying @ 2021-04-14 1:04 UTC (permalink / raw)
To: Tim Chen
Cc: Miaohe Lin, akpm, hannes, mhocko, iamjoonsoo.kim, vbabka,
alex.shi, willy, minchan, richard.weiyang, hughd, linux-kernel,
linux-mm
Tim Chen <tim.c.chen@linux.intel.com> writes:
> On 4/12/21 6:27 PM, Huang, Ying wrote:
>
>>
>> This isn't the commit that introduces the race. You can use `git blame`
>> find out the correct commit. For this it's commit 0bcac06f27d7 "mm,
>> swap: skip swapcache for swapin of synchronous device".
>>
>> And I suggest to merge 1/5 and 2/5 to make it easy to get the full
>> picture.
>
> I'll suggest make fix to do_swap_page race with get/put_swap_device
> as a first patch. Then the per_cpu_ref stuff in patch 1 and patch 2 can
> be combined together.
The original get/put_swap_device() use rcu_read_lock/unlock(). I don't
think it's good to wrap swap_read_page() with it. After all, some
complex operations are done in swap_read_page(), including
blk_io_schedule().
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff
2021-04-13 1:27 ` Huang, Ying
(?)
@ 2021-04-13 19:24 ` Tim Chen
2021-04-14 1:04 ` Huang, Ying
-1 siblings, 1 reply; 24+ messages in thread
From: Tim Chen @ 2021-04-13 19:24 UTC (permalink / raw)
To: Huang, Ying, Miaohe Lin
Cc: akpm, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, willy,
minchan, richard.weiyang, hughd, linux-kernel, linux-mm
On 4/12/21 6:27 PM, Huang, Ying wrote:
>
> This isn't the commit that introduces the race. You can use `git blame`
> find out the correct commit. For this it's commit 0bcac06f27d7 "mm,
> swap: skip swapcache for swapin of synchronous device".
>
> And I suggest to merge 1/5 and 2/5 to make it easy to get the full
> picture.
I'll suggest make fix to do_swap_page race with get/put_swap_device
as a first patch. Then the per_cpu_ref stuff in patch 1 and patch 2 can
be combined together.
Tim
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff
2021-04-08 13:08 ` [PATCH 2/5] swap: fix do_swap_page() race with swapoff Miaohe Lin
@ 2021-04-13 1:27 ` Huang, Ying
2021-04-08 21:37 ` kernel test robot
` (2 subsequent siblings)
3 siblings, 0 replies; 24+ messages in thread
From: Huang, Ying @ 2021-04-13 1:27 UTC (permalink / raw)
To: Miaohe Lin
Cc: akpm, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, willy,
minchan, richard.weiyang, hughd, tim.c.chen, linux-kernel,
linux-mm, Matthew Wilcox
Miaohe Lin <linmiaohe@huawei.com> writes:
> When I was investigating the swap code, I found the below possible race
> window:
>
> CPU 1 CPU 2
> ----- -----
> do_swap_page
> synchronous swap_readpage
> alloc_page_vma
> swapoff
> release swap_file, bdev, or ...
> swap_readpage
> check sis->flags is ok
> access swap_file, bdev...[oops!]
> si->flags = 0
>
> Using current get/put_swap_device() to guard against concurrent swapoff for
> swap_readpage() looks terrible because swap_readpage() may take really long
> time. And this race may not be really pernicious because swapoff is usually
> done when system shutdown only. To reduce the performance overhead on the
> hot-path as much as possible, it appears we can use the percpu_ref to close
> this race window(as suggested by Huang, Ying).
>
> Fixes: 235b62176712 ("mm/swap: add cluster lock")
This isn't the commit that introduces the race. You can use `git blame`
find out the correct commit. For this it's commit 0bcac06f27d7 "mm,
swap: skip swapcache for swapin of synchronous device".
And I suggest to merge 1/5 and 2/5 to make it easy to get the full
picture.
> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> ---
> include/linux/swap.h | 2 +-
> mm/memory.c | 10 ++++++++++
> mm/swapfile.c | 28 +++++++++++-----------------
> 3 files changed, 22 insertions(+), 18 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 849ba5265c11..9066addb57fd 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -513,7 +513,7 @@ sector_t swap_page_sector(struct page *page);
>
> static inline void put_swap_device(struct swap_info_struct *si)
> {
> - rcu_read_unlock();
> + percpu_ref_put(&si->users);
> }
>
> #else /* CONFIG_SWAP */
> diff --git a/mm/memory.c b/mm/memory.c
> index cc71a445c76c..8543c47b955c 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3311,6 +3311,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> {
> struct vm_area_struct *vma = vmf->vma;
> struct page *page = NULL, *swapcache;
> + struct swap_info_struct *si = NULL;
> swp_entry_t entry;
> pte_t pte;
> int locked;
> @@ -3339,6 +3340,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> }
>
>
I suggest to add comments here as follows (words copy from Matthew Wilcox)
/* Prevent swapoff from happening to us */
> + si = get_swap_device(entry);
> + /* In case we raced with swapoff. */
> + if (unlikely(!si))
> + goto out;
> +
Because we wrap the whole do_swap_page() with get/put_swap_device()
now. We can remove several get/put_swap_device() for function called by
do_swap_page(). That can be another optimization patch.
> delayacct_set_flag(DELAYACCT_PF_SWAPIN);
> page = lookup_swap_cache(entry, vma, vmf->address);
> swapcache = page;
> @@ -3514,6 +3520,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> unlock:
> pte_unmap_unlock(vmf->pte, vmf->ptl);
> out:
> + if (si)
> + put_swap_device(si);
> return ret;
> out_nomap:
> pte_unmap_unlock(vmf->pte, vmf->ptl);
> @@ -3525,6 +3533,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> unlock_page(swapcache);
> put_page(swapcache);
> }
> + if (si)
> + put_swap_device(si);
> return ret;
> }
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 724173cd7d0c..01032c72ceae 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1280,18 +1280,12 @@ static unsigned char __swap_entry_free_locked(struct swap_info_struct *p,
> * via preventing the swap device from being swapoff, until
> * put_swap_device() is called. Otherwise return NULL.
> *
> - * The entirety of the RCU read critical section must come before the
> - * return from or after the call to synchronize_rcu() in
> - * enable_swap_info() or swapoff(). So if "si->flags & SWP_VALID" is
> - * true, the si->map, si->cluster_info, etc. must be valid in the
> - * critical section.
> - *
> * Notice that swapoff or swapoff+swapon can still happen before the
> - * rcu_read_lock() in get_swap_device() or after the rcu_read_unlock()
> - * in put_swap_device() if there isn't any other way to prevent
> - * swapoff, such as page lock, page table lock, etc. The caller must
> - * be prepared for that. For example, the following situation is
> - * possible.
> + * percpu_ref_tryget_live() in get_swap_device() or after the
> + * percpu_ref_put() in put_swap_device() if there isn't any other way
> + * to prevent swapoff, such as page lock, page table lock, etc. The
> + * caller must be prepared for that. For example, the following
> + * situation is possible.
> *
> * CPU1 CPU2
> * do_swap_page()
> @@ -1319,21 +1313,21 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry)
> si = swp_swap_info(entry);
> if (!si)
> goto bad_nofile;
> -
> - rcu_read_lock();
> if (data_race(!(si->flags & SWP_VALID)))
We can delete SWP_VALID, that is used together with RCU solution.
> - goto unlock_out;
> + goto out;
> + if (!percpu_ref_tryget_live(&si->users))
> + goto out;
> offset = swp_offset(entry);
> if (offset >= si->max)
> - goto unlock_out;
> + goto put_out;
>
> return si;
> bad_nofile:
> pr_err("%s: %s%08lx\n", __func__, Bad_file, entry.val);
> out:
> return NULL;
> -unlock_out:
> - rcu_read_unlock();
> +put_out:
> + percpu_ref_put(&si->users);
> return NULL;
> }
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff
@ 2021-04-13 1:27 ` Huang, Ying
0 siblings, 0 replies; 24+ messages in thread
From: Huang, Ying @ 2021-04-13 1:27 UTC (permalink / raw)
To: Miaohe Lin
Cc: akpm, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, willy,
minchan, richard.weiyang, hughd, tim.c.chen, linux-kernel,
linux-mm, Matthew Wilcox
Miaohe Lin <linmiaohe@huawei.com> writes:
> When I was investigating the swap code, I found the below possible race
> window:
>
> CPU 1 CPU 2
> ----- -----
> do_swap_page
> synchronous swap_readpage
> alloc_page_vma
> swapoff
> release swap_file, bdev, or ...
> swap_readpage
> check sis->flags is ok
> access swap_file, bdev...[oops!]
> si->flags = 0
>
> Using current get/put_swap_device() to guard against concurrent swapoff for
> swap_readpage() looks terrible because swap_readpage() may take really long
> time. And this race may not be really pernicious because swapoff is usually
> done when system shutdown only. To reduce the performance overhead on the
> hot-path as much as possible, it appears we can use the percpu_ref to close
> this race window(as suggested by Huang, Ying).
>
> Fixes: 235b62176712 ("mm/swap: add cluster lock")
This isn't the commit that introduces the race. You can use `git blame`
find out the correct commit. For this it's commit 0bcac06f27d7 "mm,
swap: skip swapcache for swapin of synchronous device".
And I suggest to merge 1/5 and 2/5 to make it easy to get the full
picture.
> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> ---
> include/linux/swap.h | 2 +-
> mm/memory.c | 10 ++++++++++
> mm/swapfile.c | 28 +++++++++++-----------------
> 3 files changed, 22 insertions(+), 18 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 849ba5265c11..9066addb57fd 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -513,7 +513,7 @@ sector_t swap_page_sector(struct page *page);
>
> static inline void put_swap_device(struct swap_info_struct *si)
> {
> - rcu_read_unlock();
> + percpu_ref_put(&si->users);
> }
>
> #else /* CONFIG_SWAP */
> diff --git a/mm/memory.c b/mm/memory.c
> index cc71a445c76c..8543c47b955c 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3311,6 +3311,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> {
> struct vm_area_struct *vma = vmf->vma;
> struct page *page = NULL, *swapcache;
> + struct swap_info_struct *si = NULL;
> swp_entry_t entry;
> pte_t pte;
> int locked;
> @@ -3339,6 +3340,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> }
>
>
I suggest to add comments here as follows (words copy from Matthew Wilcox)
/* Prevent swapoff from happening to us */
> + si = get_swap_device(entry);
> + /* In case we raced with swapoff. */
> + if (unlikely(!si))
> + goto out;
> +
Because we wrap the whole do_swap_page() with get/put_swap_device()
now. We can remove several get/put_swap_device() for function called by
do_swap_page(). That can be another optimization patch.
> delayacct_set_flag(DELAYACCT_PF_SWAPIN);
> page = lookup_swap_cache(entry, vma, vmf->address);
> swapcache = page;
> @@ -3514,6 +3520,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> unlock:
> pte_unmap_unlock(vmf->pte, vmf->ptl);
> out:
> + if (si)
> + put_swap_device(si);
> return ret;
> out_nomap:
> pte_unmap_unlock(vmf->pte, vmf->ptl);
> @@ -3525,6 +3533,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> unlock_page(swapcache);
> put_page(swapcache);
> }
> + if (si)
> + put_swap_device(si);
> return ret;
> }
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 724173cd7d0c..01032c72ceae 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1280,18 +1280,12 @@ static unsigned char __swap_entry_free_locked(struct swap_info_struct *p,
> * via preventing the swap device from being swapoff, until
> * put_swap_device() is called. Otherwise return NULL.
> *
> - * The entirety of the RCU read critical section must come before the
> - * return from or after the call to synchronize_rcu() in
> - * enable_swap_info() or swapoff(). So if "si->flags & SWP_VALID" is
> - * true, the si->map, si->cluster_info, etc. must be valid in the
> - * critical section.
> - *
> * Notice that swapoff or swapoff+swapon can still happen before the
> - * rcu_read_lock() in get_swap_device() or after the rcu_read_unlock()
> - * in put_swap_device() if there isn't any other way to prevent
> - * swapoff, such as page lock, page table lock, etc. The caller must
> - * be prepared for that. For example, the following situation is
> - * possible.
> + * percpu_ref_tryget_live() in get_swap_device() or after the
> + * percpu_ref_put() in put_swap_device() if there isn't any other way
> + * to prevent swapoff, such as page lock, page table lock, etc. The
> + * caller must be prepared for that. For example, the following
> + * situation is possible.
> *
> * CPU1 CPU2
> * do_swap_page()
> @@ -1319,21 +1313,21 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry)
> si = swp_swap_info(entry);
> if (!si)
> goto bad_nofile;
> -
> - rcu_read_lock();
> if (data_race(!(si->flags & SWP_VALID)))
We can delete SWP_VALID, that is used together with RCU solution.
> - goto unlock_out;
> + goto out;
> + if (!percpu_ref_tryget_live(&si->users))
> + goto out;
> offset = swp_offset(entry);
> if (offset >= si->max)
> - goto unlock_out;
> + goto put_out;
>
> return si;
> bad_nofile:
> pr_err("%s: %s%08lx\n", __func__, Bad_file, entry.val);
> out:
> return NULL;
> -unlock_out:
> - rcu_read_unlock();
> +put_out:
> + percpu_ref_put(&si->users);
> return NULL;
> }
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff
2021-04-12 1:44 ` Huang, Ying
(?)
@ 2021-04-12 3:24 ` Miaohe Lin
-1 siblings, 0 replies; 24+ messages in thread
From: Miaohe Lin @ 2021-04-12 3:24 UTC (permalink / raw)
To: Huang, Ying
Cc: Tim Chen, akpm, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi,
willy, minchan, richard.weiyang, hughd, linux-kernel, linux-mm
On 2021/4/12 9:44, Huang, Ying wrote:
> Miaohe Lin <linmiaohe@huawei.com> writes:
>
>> On 2021/4/10 1:17, Tim Chen wrote:
>>>
>>>
>>> On 4/9/21 1:42 AM, Miaohe Lin wrote:
>>>> On 2021/4/9 5:34, Tim Chen wrote:
>>>>>
>>>>>
>>>>> On 4/8/21 6:08 AM, Miaohe Lin wrote:
>>>>>> When I was investigating the swap code, I found the below possible race
>>>>>> window:
>>>>>>
>>>>>> CPU 1 CPU 2
>>>>>> ----- -----
>>>>>> do_swap_page
>>>>>> synchronous swap_readpage
>>>>>> alloc_page_vma
>>>>>> swapoff
>>>>>> release swap_file, bdev, or ...
>>>>>
>>>>
>>>> Many thanks for quick review and reply!
>>>>
>>>>> Perhaps I'm missing something. The release of swap_file, bdev etc
>>>>> happens after we have cleared the SWP_VALID bit in si->flags in destroy_swap_extents
>>>>> if I read the swapoff code correctly.
>>>> Agree. Let's look this more close:
>>>> CPU1 CPU2
>>>> ----- -----
>>>> swap_readpage
>>>> if (data_race(sis->flags & SWP_FS_OPS)) {
>>>> swapoff
>>>> p->swap_file = NULL;
>>>> struct file *swap_file = sis->swap_file;
>>>> struct address_space *mapping = swap_file->f_mapping;[oops!]
>>>> ...
>>>> p->flags = 0;
>>>> ...
>>>>
>>>> Does this make sense for you?
>>>
>>> p->swapfile = NULL happens after the
>>> p->flags &= ~SWP_VALID, synchronize_rcu(), destroy_swap_extents() sequence in swapoff().
>>>
>>> So I don't think the sequence you illustrated on CPU2 is in the right order.
>>> That said, without get_swap_device/put_swap_device in swap_readpage, you could
>>> potentially blow pass synchronize_rcu() on CPU2 and causes a problem. so I think
>>> the problematic race looks something like the following:
>>>
>>>
>>> CPU1 CPU2
>>> ----- -----
>>> swap_readpage
>>> if (data_race(sis->flags & SWP_FS_OPS)) {
>>> swapoff
>>> p->flags = &= ~SWP_VALID;
>>> ..
>>> synchronize_rcu();
>>> ..
>>> p->swap_file = NULL;
>>> struct file *swap_file = sis->swap_file;
>>> struct address_space *mapping = swap_file->f_mapping;[oops!]
>>> ...
>>> ...
>>>
>>
>> Agree. This is also what I meant to illustrate. And you provide a better one. Many thanks!
>
> For the pages that are swapped in through swap cache. That isn't an
> issue. Because the page is locked, the swap entry will be marked with
> SWAP_HAS_CACHE, so swapoff() cannot proceed until the page has been
> unlocked.
>
> So the race is for the fast path as follows,
>
> if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> __swap_count(entry) == 1)
>
> I found it in your original patch description. But please make it more
> explicit to reduce the potential confusing.
Sure. Should I rephrase the commit log to clarify this or add a comment in the code?
Thanks.
>
> Best Regards,
> Huang, Ying
> .
>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff
2021-04-10 3:17 ` Miaohe Lin
@ 2021-04-12 1:44 ` Huang, Ying
0 siblings, 0 replies; 24+ messages in thread
From: Huang, Ying @ 2021-04-12 1:44 UTC (permalink / raw)
To: Miaohe Lin
Cc: Tim Chen, akpm, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi,
willy, minchan, richard.weiyang, hughd, linux-kernel, linux-mm
Miaohe Lin <linmiaohe@huawei.com> writes:
> On 2021/4/10 1:17, Tim Chen wrote:
>>
>>
>> On 4/9/21 1:42 AM, Miaohe Lin wrote:
>>> On 2021/4/9 5:34, Tim Chen wrote:
>>>>
>>>>
>>>> On 4/8/21 6:08 AM, Miaohe Lin wrote:
>>>>> When I was investigating the swap code, I found the below possible race
>>>>> window:
>>>>>
>>>>> CPU 1 CPU 2
>>>>> ----- -----
>>>>> do_swap_page
>>>>> synchronous swap_readpage
>>>>> alloc_page_vma
>>>>> swapoff
>>>>> release swap_file, bdev, or ...
>>>>
>>>
>>> Many thanks for quick review and reply!
>>>
>>>> Perhaps I'm missing something. The release of swap_file, bdev etc
>>>> happens after we have cleared the SWP_VALID bit in si->flags in destroy_swap_extents
>>>> if I read the swapoff code correctly.
>>> Agree. Let's look this more close:
>>> CPU1 CPU2
>>> ----- -----
>>> swap_readpage
>>> if (data_race(sis->flags & SWP_FS_OPS)) {
>>> swapoff
>>> p->swap_file = NULL;
>>> struct file *swap_file = sis->swap_file;
>>> struct address_space *mapping = swap_file->f_mapping;[oops!]
>>> ...
>>> p->flags = 0;
>>> ...
>>>
>>> Does this make sense for you?
>>
>> p->swapfile = NULL happens after the
>> p->flags &= ~SWP_VALID, synchronize_rcu(), destroy_swap_extents() sequence in swapoff().
>>
>> So I don't think the sequence you illustrated on CPU2 is in the right order.
>> That said, without get_swap_device/put_swap_device in swap_readpage, you could
>> potentially blow pass synchronize_rcu() on CPU2 and causes a problem. so I think
>> the problematic race looks something like the following:
>>
>>
>> CPU1 CPU2
>> ----- -----
>> swap_readpage
>> if (data_race(sis->flags & SWP_FS_OPS)) {
>> swapoff
>> p->flags = &= ~SWP_VALID;
>> ..
>> synchronize_rcu();
>> ..
>> p->swap_file = NULL;
>> struct file *swap_file = sis->swap_file;
>> struct address_space *mapping = swap_file->f_mapping;[oops!]
>> ...
>> ...
>>
>
> Agree. This is also what I meant to illustrate. And you provide a better one. Many thanks!
For the pages that are swapped in through swap cache. That isn't an
issue. Because the page is locked, the swap entry will be marked with
SWAP_HAS_CACHE, so swapoff() cannot proceed until the page has been
unlocked.
So the race is for the fast path as follows,
if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
__swap_count(entry) == 1)
I found it in your original patch description. But please make it more
explicit to reduce the potential confusing.
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff
@ 2021-04-12 1:44 ` Huang, Ying
0 siblings, 0 replies; 24+ messages in thread
From: Huang, Ying @ 2021-04-12 1:44 UTC (permalink / raw)
To: Miaohe Lin
Cc: Tim Chen, akpm, hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi,
willy, minchan, richard.weiyang, hughd, linux-kernel, linux-mm
Miaohe Lin <linmiaohe@huawei.com> writes:
> On 2021/4/10 1:17, Tim Chen wrote:
>>
>>
>> On 4/9/21 1:42 AM, Miaohe Lin wrote:
>>> On 2021/4/9 5:34, Tim Chen wrote:
>>>>
>>>>
>>>> On 4/8/21 6:08 AM, Miaohe Lin wrote:
>>>>> When I was investigating the swap code, I found the below possible race
>>>>> window:
>>>>>
>>>>> CPU 1 CPU 2
>>>>> ----- -----
>>>>> do_swap_page
>>>>> synchronous swap_readpage
>>>>> alloc_page_vma
>>>>> swapoff
>>>>> release swap_file, bdev, or ...
>>>>
>>>
>>> Many thanks for quick review and reply!
>>>
>>>> Perhaps I'm missing something. The release of swap_file, bdev etc
>>>> happens after we have cleared the SWP_VALID bit in si->flags in destroy_swap_extents
>>>> if I read the swapoff code correctly.
>>> Agree. Let's look this more close:
>>> CPU1 CPU2
>>> ----- -----
>>> swap_readpage
>>> if (data_race(sis->flags & SWP_FS_OPS)) {
>>> swapoff
>>> p->swap_file = NULL;
>>> struct file *swap_file = sis->swap_file;
>>> struct address_space *mapping = swap_file->f_mapping;[oops!]
>>> ...
>>> p->flags = 0;
>>> ...
>>>
>>> Does this make sense for you?
>>
>> p->swapfile = NULL happens after the
>> p->flags &= ~SWP_VALID, synchronize_rcu(), destroy_swap_extents() sequence in swapoff().
>>
>> So I don't think the sequence you illustrated on CPU2 is in the right order.
>> That said, without get_swap_device/put_swap_device in swap_readpage, you could
>> potentially blow pass synchronize_rcu() on CPU2 and causes a problem. so I think
>> the problematic race looks something like the following:
>>
>>
>> CPU1 CPU2
>> ----- -----
>> swap_readpage
>> if (data_race(sis->flags & SWP_FS_OPS)) {
>> swapoff
>> p->flags = &= ~SWP_VALID;
>> ..
>> synchronize_rcu();
>> ..
>> p->swap_file = NULL;
>> struct file *swap_file = sis->swap_file;
>> struct address_space *mapping = swap_file->f_mapping;[oops!]
>> ...
>> ...
>>
>
> Agree. This is also what I meant to illustrate. And you provide a better one. Many thanks!
For the pages that are swapped in through swap cache. That isn't an
issue. Because the page is locked, the swap entry will be marked with
SWAP_HAS_CACHE, so swapoff() cannot proceed until the page has been
unlocked.
So the race is for the fast path as follows,
if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
__swap_count(entry) == 1)
I found it in your original patch description. But please make it more
explicit to reduce the potential confusing.
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff
2021-04-09 17:17 ` Tim Chen
@ 2021-04-10 3:17 ` Miaohe Lin
2021-04-12 1:44 ` Huang, Ying
0 siblings, 1 reply; 24+ messages in thread
From: Miaohe Lin @ 2021-04-10 3:17 UTC (permalink / raw)
To: Tim Chen, akpm
Cc: hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, willy, minchan,
richard.weiyang, ying.huang, hughd, linux-kernel, linux-mm
On 2021/4/10 1:17, Tim Chen wrote:
>
>
> On 4/9/21 1:42 AM, Miaohe Lin wrote:
>> On 2021/4/9 5:34, Tim Chen wrote:
>>>
>>>
>>> On 4/8/21 6:08 AM, Miaohe Lin wrote:
>>>> When I was investigating the swap code, I found the below possible race
>>>> window:
>>>>
>>>> CPU 1 CPU 2
>>>> ----- -----
>>>> do_swap_page
>>>> synchronous swap_readpage
>>>> alloc_page_vma
>>>> swapoff
>>>> release swap_file, bdev, or ...
>>>
>>
>> Many thanks for quick review and reply!
>>
>>> Perhaps I'm missing something. The release of swap_file, bdev etc
>>> happens after we have cleared the SWP_VALID bit in si->flags in destroy_swap_extents
>>> if I read the swapoff code correctly.
>> Agree. Let's look this more close:
>> CPU1 CPU2
>> ----- -----
>> swap_readpage
>> if (data_race(sis->flags & SWP_FS_OPS)) {
>> swapoff
>> p->swap_file = NULL;
>> struct file *swap_file = sis->swap_file;
>> struct address_space *mapping = swap_file->f_mapping;[oops!]
>> ...
>> p->flags = 0;
>> ...
>>
>> Does this make sense for you?
>
> p->swapfile = NULL happens after the
> p->flags &= ~SWP_VALID, synchronize_rcu(), destroy_swap_extents() sequence in swapoff().
>
> So I don't think the sequence you illustrated on CPU2 is in the right order.
> That said, without get_swap_device/put_swap_device in swap_readpage, you could
> potentially blow pass synchronize_rcu() on CPU2 and causes a problem. so I think
> the problematic race looks something like the following:
>
>
> CPU1 CPU2
> ----- -----
> swap_readpage
> if (data_race(sis->flags & SWP_FS_OPS)) {
> swapoff
> p->flags = &= ~SWP_VALID;
> ..
> synchronize_rcu();
> ..
> p->swap_file = NULL;
> struct file *swap_file = sis->swap_file;
> struct address_space *mapping = swap_file->f_mapping;[oops!]
> ...
> ...
>
Agree. This is also what I meant to illustrate. And you provide a better one. Many thanks!
> By adding get_swap_device/put_swap_device, then the race is fixed.
>
>
> CPU1 CPU2
> ----- -----
> swap_readpage
> get_swap_device()
> ..
> if (data_race(sis->flags & SWP_FS_OPS)) {
> swapoff
> p->flags = &= ~SWP_VALID;
> ..
> struct file *swap_file = sis->swap_file;
> struct address_space *mapping = swap_file->f_mapping;[valid value]
> ..
> put_swap_device()
> synchronize_rcu();
> ..
> p->swap_file = NULL;
>
>
>>
>>>>
>>>> swap_readpage
>>>> check sis->flags is ok
>>>> access swap_file, bdev...[oops!]
>>>> si->flags = 0
>>>
>>> This happens after we clear the si->flags
>>> synchronize_rcu()
>>> release swap_file, bdev, in destroy_swap_extents()
>>>
>>> So I think if we have get_swap_device/put_swap_device in do_swap_page,
>>> it should fix the race you've pointed out here.
>>> Then synchronize_rcu() will wait till we have completed do_swap_page and
>>> call put_swap_device.
>>
>> Right, get_swap_device/put_swap_device could fix this race. __But__ rcu_read_lock()
>> in get_swap_device() could disable preempt and do_swap_page() may take a really long
>> time because it involves I/O. It may not be acceptable to disable preempt for such a
>> long time. :(
>
> I can see that it is not a good idea to hold rcu read lock for a long
> time over slow file I/O operation, which will be the side effect of
> introducing get/put_swap_device to swap_readpage. So using percpu_ref
> will then be preferable for synchronization once we introduce
> get/put_swap_device into swap_readpage.
>
The sis->bdev should also be protected by get/put_swap_device. It has the similar
issue. And swap_slot_free_notify (called from callback end_swap_bio_read) would
race with swapoff too. So I use get/put_swap_device to protect swap_readpage until
file I/O operation is completed.
Thanks again!
> Tim
> .
>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff
2021-04-09 8:42 ` Miaohe Lin
@ 2021-04-09 17:17 ` Tim Chen
2021-04-10 3:17 ` Miaohe Lin
0 siblings, 1 reply; 24+ messages in thread
From: Tim Chen @ 2021-04-09 17:17 UTC (permalink / raw)
To: Miaohe Lin, akpm
Cc: hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, willy, minchan,
richard.weiyang, ying.huang, hughd, linux-kernel, linux-mm
On 4/9/21 1:42 AM, Miaohe Lin wrote:
> On 2021/4/9 5:34, Tim Chen wrote:
>>
>>
>> On 4/8/21 6:08 AM, Miaohe Lin wrote:
>>> When I was investigating the swap code, I found the below possible race
>>> window:
>>>
>>> CPU 1 CPU 2
>>> ----- -----
>>> do_swap_page
>>> synchronous swap_readpage
>>> alloc_page_vma
>>> swapoff
>>> release swap_file, bdev, or ...
>>
>
> Many thanks for quick review and reply!
>
>> Perhaps I'm missing something. The release of swap_file, bdev etc
>> happens after we have cleared the SWP_VALID bit in si->flags in destroy_swap_extents
>> if I read the swapoff code correctly.
> Agree. Let's look this more close:
> CPU1 CPU2
> ----- -----
> swap_readpage
> if (data_race(sis->flags & SWP_FS_OPS)) {
> swapoff
> p->swap_file = NULL;
> struct file *swap_file = sis->swap_file;
> struct address_space *mapping = swap_file->f_mapping;[oops!]
> ...
> p->flags = 0;
> ...
>
> Does this make sense for you?
p->swapfile = NULL happens after the
p->flags &= ~SWP_VALID, synchronize_rcu(), destroy_swap_extents() sequence in swapoff().
So I don't think the sequence you illustrated on CPU2 is in the right order.
That said, without get_swap_device/put_swap_device in swap_readpage, you could
potentially blow pass synchronize_rcu() on CPU2 and causes a problem. so I think
the problematic race looks something like the following:
CPU1 CPU2
----- -----
swap_readpage
if (data_race(sis->flags & SWP_FS_OPS)) {
swapoff
p->flags = &= ~SWP_VALID;
..
synchronize_rcu();
..
p->swap_file = NULL;
struct file *swap_file = sis->swap_file;
struct address_space *mapping = swap_file->f_mapping;[oops!]
...
...
By adding get_swap_device/put_swap_device, then the race is fixed.
CPU1 CPU2
----- -----
swap_readpage
get_swap_device()
..
if (data_race(sis->flags & SWP_FS_OPS)) {
swapoff
p->flags = &= ~SWP_VALID;
..
struct file *swap_file = sis->swap_file;
struct address_space *mapping = swap_file->f_mapping;[valid value]
..
put_swap_device()
synchronize_rcu();
..
p->swap_file = NULL;
>
>>>
>>> swap_readpage
>>> check sis->flags is ok
>>> access swap_file, bdev...[oops!]
>>> si->flags = 0
>>
>> This happens after we clear the si->flags
>> synchronize_rcu()
>> release swap_file, bdev, in destroy_swap_extents()
>>
>> So I think if we have get_swap_device/put_swap_device in do_swap_page,
>> it should fix the race you've pointed out here.
>> Then synchronize_rcu() will wait till we have completed do_swap_page and
>> call put_swap_device.
>
> Right, get_swap_device/put_swap_device could fix this race. __But__ rcu_read_lock()
> in get_swap_device() could disable preempt and do_swap_page() may take a really long
> time because it involves I/O. It may not be acceptable to disable preempt for such a
> long time. :(
I can see that it is not a good idea to hold rcu read lock for a long
time over slow file I/O operation, which will be the side effect of
introducing get/put_swap_device to swap_readpage. So using percpu_ref
will then be preferable for synchronization once we introduce
get/put_swap_device into swap_readpage.
Tim
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff
2021-04-08 21:37 ` kernel test robot
@ 2021-04-09 8:46 ` Miaohe Lin
0 siblings, 0 replies; 24+ messages in thread
From: Miaohe Lin @ 2021-04-09 8:46 UTC (permalink / raw)
To: kbuild-all
[-- Attachment #1: Type: text/plain, Size: 12132 bytes --]
On 2021/4/9 5:37, kernel test robot wrote:
> Hi Miaohe,
>
> Thank you for the patch! Yet something to improve:
>
> [auto build test ERROR on linux/master]
> [also build test ERROR on linus/master hnaz-linux-mm/master v5.12-rc6 next-20210408]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch]
>
> url: https://github.com/0day-ci/linux/commits/Miaohe-Lin/close-various-race-windows-for-swap/20210408-211224
> base: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 5e46d1b78a03d52306f21f77a4e4a144b6d31486
> config: x86_64-randconfig-a012-20210408 (attached as .config)
> compiler: clang version 13.0.0 (https://github.com/llvm/llvm-project 56ea2e2fdd691136d5e6631fa0e447173694b82c)
> reproduce (this is a W=1 build):
> wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # install x86_64 cross compiling tool for clang build
> # apt-get install binutils-x86-64-linux-gnu
> # https://github.com/0day-ci/linux/commit/56e65e21c8c9858e36c3bca84006a15fe9b85efd
> git remote add linux-review https://github.com/0day-ci/linux
> git fetch --no-tags linux-review Miaohe-Lin/close-various-race-windows-for-swap/20210408-211224
> git checkout 56e65e21c8c9858e36c3bca84006a15fe9b85efd
> # save the attached .config to linux build tree
> COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=x86_64
>
> If you fix the issue, kindly add following tag as appropriate
> Reported-by: kernel test robot <lkp@intel.com>
>
> All error/warnings (new ones prefixed by >>):
>
>>> mm/memory.c:3300:7: error: implicit declaration of function 'get_swap_device' [-Werror,-Wimplicit-function-declaration]
> si = get_swap_device(entry);
> ^
> mm/memory.c:3300:7: note: did you mean 'get_cpu_device'?
> include/linux/cpu.h:38:23: note: 'get_cpu_device' declared here
> extern struct device *get_cpu_device(unsigned cpu);
> ^
>>> mm/memory.c:3300:5: warning: incompatible integer to pointer conversion assigning to 'struct swap_info_struct *' from 'int' [-Wint-conversion]
> si = get_swap_device(entry);
> ^ ~~~~~~~~~~~~~~~~~~~~~~
>>> mm/memory.c:3483:3: error: implicit declaration of function 'put_swap_device' [-Werror,-Wimplicit-function-declaration]
> put_swap_device(si);
> ^
> mm/memory.c:3483:3: note: did you mean 'get_swap_device'?
> mm/memory.c:3300:7: note: 'get_swap_device' declared here
> si = get_swap_device(entry);
> ^
> 1 warning and 2 errors generated.
>
Many thanks. Will fix it.
>
> vim +/get_swap_device +3300 mm/memory.c
>
> 3258
> 3259 /*
> 3260 * We enter with non-exclusive mmap_lock (to exclude vma changes,
> 3261 * but allow concurrent faults), and pte mapped but not yet locked.
> 3262 * We return with pte unmapped and unlocked.
> 3263 *
> 3264 * We return with the mmap_lock locked or unlocked in the same cases
> 3265 * as does filemap_fault().
> 3266 */
> 3267 vm_fault_t do_swap_page(struct vm_fault *vmf)
> 3268 {
> 3269 struct vm_area_struct *vma = vmf->vma;
> 3270 struct page *page = NULL, *swapcache;
> 3271 struct swap_info_struct *si = NULL;
> 3272 swp_entry_t entry;
> 3273 pte_t pte;
> 3274 int locked;
> 3275 int exclusive = 0;
> 3276 vm_fault_t ret = 0;
> 3277 void *shadow = NULL;
> 3278
> 3279 if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte))
> 3280 goto out;
> 3281
> 3282 entry = pte_to_swp_entry(vmf->orig_pte);
> 3283 if (unlikely(non_swap_entry(entry))) {
> 3284 if (is_migration_entry(entry)) {
> 3285 migration_entry_wait(vma->vm_mm, vmf->pmd,
> 3286 vmf->address);
> 3287 } else if (is_device_private_entry(entry)) {
> 3288 vmf->page = device_private_entry_to_page(entry);
> 3289 ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
> 3290 } else if (is_hwpoison_entry(entry)) {
> 3291 ret = VM_FAULT_HWPOISON;
> 3292 } else {
> 3293 print_bad_pte(vma, vmf->address, vmf->orig_pte, NULL);
> 3294 ret = VM_FAULT_SIGBUS;
> 3295 }
> 3296 goto out;
> 3297 }
> 3298
> 3299
>> 3300 si = get_swap_device(entry);
> 3301 /* In case we raced with swapoff. */
> 3302 if (unlikely(!si))
> 3303 goto out;
> 3304
> 3305 delayacct_set_flag(DELAYACCT_PF_SWAPIN);
> 3306 page = lookup_swap_cache(entry, vma, vmf->address);
> 3307 swapcache = page;
> 3308
> 3309 if (!page) {
> 3310 struct swap_info_struct *si = swp_swap_info(entry);
> 3311
> 3312 if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> 3313 __swap_count(entry) == 1) {
> 3314 /* skip swapcache */
> 3315 page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
> 3316 vmf->address);
> 3317 if (page) {
> 3318 int err;
> 3319
> 3320 __SetPageLocked(page);
> 3321 __SetPageSwapBacked(page);
> 3322 set_page_private(page, entry.val);
> 3323
> 3324 /* Tell memcg to use swap ownership records */
> 3325 SetPageSwapCache(page);
> 3326 err = mem_cgroup_charge(page, vma->vm_mm,
> 3327 GFP_KERNEL);
> 3328 ClearPageSwapCache(page);
> 3329 if (err) {
> 3330 ret = VM_FAULT_OOM;
> 3331 goto out_page;
> 3332 }
> 3333
> 3334 shadow = get_shadow_from_swap_cache(entry);
> 3335 if (shadow)
> 3336 workingset_refault(page, shadow);
> 3337
> 3338 lru_cache_add(page);
> 3339 swap_readpage(page, true);
> 3340 }
> 3341 } else {
> 3342 page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
> 3343 vmf);
> 3344 swapcache = page;
> 3345 }
> 3346
> 3347 if (!page) {
> 3348 /*
> 3349 * Back out if somebody else faulted in this pte
> 3350 * while we released the pte lock.
> 3351 */
> 3352 vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
> 3353 vmf->address, &vmf->ptl);
> 3354 if (likely(pte_same(*vmf->pte, vmf->orig_pte)))
> 3355 ret = VM_FAULT_OOM;
> 3356 delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
> 3357 goto unlock;
> 3358 }
> 3359
> 3360 /* Had to read the page from swap area: Major fault */
> 3361 ret = VM_FAULT_MAJOR;
> 3362 count_vm_event(PGMAJFAULT);
> 3363 count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
> 3364 } else if (PageHWPoison(page)) {
> 3365 /*
> 3366 * hwpoisoned dirty swapcache pages are kept for killing
> 3367 * owner processes (which may be unknown at hwpoison time)
> 3368 */
> 3369 ret = VM_FAULT_HWPOISON;
> 3370 delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
> 3371 goto out_release;
> 3372 }
> 3373
> 3374 locked = lock_page_or_retry(page, vma->vm_mm, vmf->flags);
> 3375
> 3376 delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
> 3377 if (!locked) {
> 3378 ret |= VM_FAULT_RETRY;
> 3379 goto out_release;
> 3380 }
> 3381
> 3382 /*
> 3383 * Make sure try_to_free_swap or reuse_swap_page or swapoff did not
> 3384 * release the swapcache from under us. The page pin, and pte_same
> 3385 * test below, are not enough to exclude that. Even if it is still
> 3386 * swapcache, we need to check that the page's swap has not changed.
> 3387 */
> 3388 if (unlikely((!PageSwapCache(page) ||
> 3389 page_private(page) != entry.val)) && swapcache)
> 3390 goto out_page;
> 3391
> 3392 page = ksm_might_need_to_copy(page, vma, vmf->address);
> 3393 if (unlikely(!page)) {
> 3394 ret = VM_FAULT_OOM;
> 3395 page = swapcache;
> 3396 goto out_page;
> 3397 }
> 3398
> 3399 cgroup_throttle_swaprate(page, GFP_KERNEL);
> 3400
> 3401 /*
> 3402 * Back out if somebody else already faulted in this pte.
> 3403 */
> 3404 vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> 3405 &vmf->ptl);
> 3406 if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte)))
> 3407 goto out_nomap;
> 3408
> 3409 if (unlikely(!PageUptodate(page))) {
> 3410 ret = VM_FAULT_SIGBUS;
> 3411 goto out_nomap;
> 3412 }
> 3413
> 3414 /*
> 3415 * The page isn't present yet, go ahead with the fault.
> 3416 *
> 3417 * Be careful about the sequence of operations here.
> 3418 * To get its accounting right, reuse_swap_page() must be called
> 3419 * while the page is counted on swap but not yet in mapcount i.e.
> 3420 * before page_add_anon_rmap() and swap_free(); try_to_free_swap()
> 3421 * must be called after the swap_free(), or it will never succeed.
> 3422 */
> 3423
> 3424 inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
> 3425 dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS);
> 3426 pte = mk_pte(page, vma->vm_page_prot);
> 3427 if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page, NULL)) {
> 3428 pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> 3429 vmf->flags &= ~FAULT_FLAG_WRITE;
> 3430 ret |= VM_FAULT_WRITE;
> 3431 exclusive = RMAP_EXCLUSIVE;
> 3432 }
> 3433 flush_icache_page(vma, page);
> 3434 if (pte_swp_soft_dirty(vmf->orig_pte))
> 3435 pte = pte_mksoft_dirty(pte);
> 3436 if (pte_swp_uffd_wp(vmf->orig_pte)) {
> 3437 pte = pte_mkuffd_wp(pte);
> 3438 pte = pte_wrprotect(pte);
> 3439 }
> 3440 set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
> 3441 arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
> 3442 vmf->orig_pte = pte;
> 3443
> 3444 /* ksm created a completely new copy */
> 3445 if (unlikely(page != swapcache && swapcache)) {
> 3446 page_add_new_anon_rmap(page, vma, vmf->address, false);
> 3447 lru_cache_add_inactive_or_unevictable(page, vma);
> 3448 } else {
> 3449 do_page_add_anon_rmap(page, vma, vmf->address, exclusive);
> 3450 }
> 3451
> 3452 swap_free(entry);
> 3453 if (mem_cgroup_swap_full(page) ||
> 3454 (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
> 3455 try_to_free_swap(page);
> 3456 unlock_page(page);
> 3457 if (page != swapcache && swapcache) {
> 3458 /*
> 3459 * Hold the lock to avoid the swap entry to be reused
> 3460 * until we take the PT lock for the pte_same() check
> 3461 * (to avoid false positives from pte_same). For
> 3462 * further safety release the lock after the swap_free
> 3463 * so that the swap count won't change under a
> 3464 * parallel locked swapcache.
> 3465 */
> 3466 unlock_page(swapcache);
> 3467 put_page(swapcache);
> 3468 }
> 3469
> 3470 if (vmf->flags & FAULT_FLAG_WRITE) {
> 3471 ret |= do_wp_page(vmf);
> 3472 if (ret & VM_FAULT_ERROR)
> 3473 ret &= VM_FAULT_ERROR;
> 3474 goto out;
> 3475 }
> 3476
> 3477 /* No need to invalidate - it was non-present before */
> 3478 update_mmu_cache(vma, vmf->address, vmf->pte);
> 3479 unlock:
> 3480 pte_unmap_unlock(vmf->pte, vmf->ptl);
> 3481 out:
> 3482 if (si)
>> 3483 put_swap_device(si);
> 3484 return ret;
> 3485 out_nomap:
> 3486 pte_unmap_unlock(vmf->pte, vmf->ptl);
> 3487 out_page:
> 3488 unlock_page(page);
> 3489 out_release:
> 3490 put_page(page);
> 3491 if (page != swapcache && swapcache) {
> 3492 unlock_page(swapcache);
> 3493 put_page(swapcache);
> 3494 }
> 3495 if (si)
> 3496 put_swap_device(si);
> 3497 return ret;
> 3498 }
> 3499
>
> ---
> 0-DAY CI Kernel Test Service, Intel Corporation
> https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org
>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff
2021-04-08 21:34 ` Tim Chen
@ 2021-04-09 8:42 ` Miaohe Lin
2021-04-09 17:17 ` Tim Chen
0 siblings, 1 reply; 24+ messages in thread
From: Miaohe Lin @ 2021-04-09 8:42 UTC (permalink / raw)
To: Tim Chen, akpm
Cc: hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, willy, minchan,
richard.weiyang, ying.huang, hughd, linux-kernel, linux-mm
On 2021/4/9 5:34, Tim Chen wrote:
>
>
> On 4/8/21 6:08 AM, Miaohe Lin wrote:
>> When I was investigating the swap code, I found the below possible race
>> window:
>>
>> CPU 1 CPU 2
>> ----- -----
>> do_swap_page
>> synchronous swap_readpage
>> alloc_page_vma
>> swapoff
>> release swap_file, bdev, or ...
>
Many thanks for quick review and reply!
> Perhaps I'm missing something. The release of swap_file, bdev etc
> happens after we have cleared the SWP_VALID bit in si->flags in destroy_swap_extents
> if I read the swapoff code correctly.
Agree. Let's look this more close:
CPU1 CPU2
----- -----
swap_readpage
if (data_race(sis->flags & SWP_FS_OPS)) {
swapoff
p->swap_file = NULL;
struct file *swap_file = sis->swap_file;
struct address_space *mapping = swap_file->f_mapping;[oops!]
...
p->flags = 0;
...
Does this make sense for you?
> >
>> swap_readpage
>> check sis->flags is ok
>> access swap_file, bdev...[oops!]
>> si->flags = 0
>
> This happens after we clear the si->flags
> synchronize_rcu()
> release swap_file, bdev, in destroy_swap_extents()
>
> So I think if we have get_swap_device/put_swap_device in do_swap_page,
> it should fix the race you've pointed out here.
> Then synchronize_rcu() will wait till we have completed do_swap_page and
> call put_swap_device.
Right, get_swap_device/put_swap_device could fix this race. __But__ rcu_read_lock()
in get_swap_device() could disable preempt and do_swap_page() may take a really long
time because it involves I/O. It may not be acceptable to disable preempt for such a
long time. :(
>
>>
>> Using current get/put_swap_device() to guard against concurrent swapoff for
>> swap_readpage() looks terrible because swap_readpage() may take really long
>> time. And this race may not be really pernicious because swapoff is usually
>> done when system shutdown only. To reduce the performance overhead on the
>> hot-path as much as possible, it appears we can use the percpu_ref to close
>> this race window(as suggested by Huang, Ying).
>
> I think it is better to break this patch into two.
> > One patch is to fix the race in do_swap_page and swapoff
> by adding get_swap_device/put_swap_device in do_swap_page.
>
> The second patch is to modify get_swap_device and put_swap_device
> with percpu_ref. But swapoff is a relatively rare events.
Sounds reasonable. Will do it.
>
> I am not sure making percpu_ref change for performance is really beneficial.
> Did you encounter a real use case where you see a problem with swapoff?
> The delay in swapoff is primarily in try_to_unuse to bring all
> the swapped off pages back into memory. Synchronizing with other
> CPU for paging in probably is a small component in overall scheme
> of things.
>
I can't find a more simple and stable way to fix this potential and *theoretical* issue.
This could happen in real word but the race window should be very small. While swapoff
is usually done when system shutdown only, I'am not really sure if this effort is worth.
But IMO, we should eliminate any potential trouble. :)
> Thanks.
>
Thanks again.
> Tim
>
> .
>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff
2021-04-08 13:08 ` [PATCH 2/5] swap: fix do_swap_page() race with swapoff Miaohe Lin
2021-04-08 21:34 ` Tim Chen
2021-04-08 21:37 ` kernel test robot
@ 2021-04-08 22:56 ` kernel test robot
2021-04-13 1:27 ` Huang, Ying
3 siblings, 0 replies; 24+ messages in thread
From: kernel test robot @ 2021-04-08 22:56 UTC (permalink / raw)
To: kbuild-all
[-- Attachment #1: Type: text/plain, Size: 11028 bytes --]
Hi Miaohe,
Thank you for the patch! Yet something to improve:
[auto build test ERROR on linux/master]
[also build test ERROR on linus/master hnaz-linux-mm/master v5.12-rc6 next-20210408]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]
url: https://github.com/0day-ci/linux/commits/Miaohe-Lin/close-various-race-windows-for-swap/20210408-211224
base: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 5e46d1b78a03d52306f21f77a4e4a144b6d31486
config: mips-randconfig-r016-20210408 (attached as .config)
compiler: mipsel-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/0day-ci/linux/commit/56e65e21c8c9858e36c3bca84006a15fe9b85efd
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Miaohe-Lin/close-various-race-windows-for-swap/20210408-211224
git checkout 56e65e21c8c9858e36c3bca84006a15fe9b85efd
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=mips
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>
All error/warnings (new ones prefixed by >>):
mm/memory.c: In function 'do_swap_page':
>> mm/memory.c:3300:7: error: implicit declaration of function 'get_swap_device'; did you mean 'get_cpu_device'? [-Werror=implicit-function-declaration]
3300 | si = get_swap_device(entry);
| ^~~~~~~~~~~~~~~
| get_cpu_device
>> mm/memory.c:3300:5: warning: assignment to 'struct swap_info_struct *' from 'int' makes pointer from integer without a cast [-Wint-conversion]
3300 | si = get_swap_device(entry);
| ^
>> mm/memory.c:3483:3: error: implicit declaration of function 'put_swap_device'; did you mean 'put_swap_page'? [-Werror=implicit-function-declaration]
3483 | put_swap_device(si);
| ^~~~~~~~~~~~~~~
| put_swap_page
cc1: some warnings being treated as errors
vim +3300 mm/memory.c
3258
3259 /*
3260 * We enter with non-exclusive mmap_lock (to exclude vma changes,
3261 * but allow concurrent faults), and pte mapped but not yet locked.
3262 * We return with pte unmapped and unlocked.
3263 *
3264 * We return with the mmap_lock locked or unlocked in the same cases
3265 * as does filemap_fault().
3266 */
3267 vm_fault_t do_swap_page(struct vm_fault *vmf)
3268 {
3269 struct vm_area_struct *vma = vmf->vma;
3270 struct page *page = NULL, *swapcache;
3271 struct swap_info_struct *si = NULL;
3272 swp_entry_t entry;
3273 pte_t pte;
3274 int locked;
3275 int exclusive = 0;
3276 vm_fault_t ret = 0;
3277 void *shadow = NULL;
3278
3279 if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte))
3280 goto out;
3281
3282 entry = pte_to_swp_entry(vmf->orig_pte);
3283 if (unlikely(non_swap_entry(entry))) {
3284 if (is_migration_entry(entry)) {
3285 migration_entry_wait(vma->vm_mm, vmf->pmd,
3286 vmf->address);
3287 } else if (is_device_private_entry(entry)) {
3288 vmf->page = device_private_entry_to_page(entry);
3289 ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
3290 } else if (is_hwpoison_entry(entry)) {
3291 ret = VM_FAULT_HWPOISON;
3292 } else {
3293 print_bad_pte(vma, vmf->address, vmf->orig_pte, NULL);
3294 ret = VM_FAULT_SIGBUS;
3295 }
3296 goto out;
3297 }
3298
3299
> 3300 si = get_swap_device(entry);
3301 /* In case we raced with swapoff. */
3302 if (unlikely(!si))
3303 goto out;
3304
3305 delayacct_set_flag(DELAYACCT_PF_SWAPIN);
3306 page = lookup_swap_cache(entry, vma, vmf->address);
3307 swapcache = page;
3308
3309 if (!page) {
3310 struct swap_info_struct *si = swp_swap_info(entry);
3311
3312 if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
3313 __swap_count(entry) == 1) {
3314 /* skip swapcache */
3315 page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
3316 vmf->address);
3317 if (page) {
3318 int err;
3319
3320 __SetPageLocked(page);
3321 __SetPageSwapBacked(page);
3322 set_page_private(page, entry.val);
3323
3324 /* Tell memcg to use swap ownership records */
3325 SetPageSwapCache(page);
3326 err = mem_cgroup_charge(page, vma->vm_mm,
3327 GFP_KERNEL);
3328 ClearPageSwapCache(page);
3329 if (err) {
3330 ret = VM_FAULT_OOM;
3331 goto out_page;
3332 }
3333
3334 shadow = get_shadow_from_swap_cache(entry);
3335 if (shadow)
3336 workingset_refault(page, shadow);
3337
3338 lru_cache_add(page);
3339 swap_readpage(page, true);
3340 }
3341 } else {
3342 page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
3343 vmf);
3344 swapcache = page;
3345 }
3346
3347 if (!page) {
3348 /*
3349 * Back out if somebody else faulted in this pte
3350 * while we released the pte lock.
3351 */
3352 vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
3353 vmf->address, &vmf->ptl);
3354 if (likely(pte_same(*vmf->pte, vmf->orig_pte)))
3355 ret = VM_FAULT_OOM;
3356 delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
3357 goto unlock;
3358 }
3359
3360 /* Had to read the page from swap area: Major fault */
3361 ret = VM_FAULT_MAJOR;
3362 count_vm_event(PGMAJFAULT);
3363 count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
3364 } else if (PageHWPoison(page)) {
3365 /*
3366 * hwpoisoned dirty swapcache pages are kept for killing
3367 * owner processes (which may be unknown at hwpoison time)
3368 */
3369 ret = VM_FAULT_HWPOISON;
3370 delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
3371 goto out_release;
3372 }
3373
3374 locked = lock_page_or_retry(page, vma->vm_mm, vmf->flags);
3375
3376 delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
3377 if (!locked) {
3378 ret |= VM_FAULT_RETRY;
3379 goto out_release;
3380 }
3381
3382 /*
3383 * Make sure try_to_free_swap or reuse_swap_page or swapoff did not
3384 * release the swapcache from under us. The page pin, and pte_same
3385 * test below, are not enough to exclude that. Even if it is still
3386 * swapcache, we need to check that the page's swap has not changed.
3387 */
3388 if (unlikely((!PageSwapCache(page) ||
3389 page_private(page) != entry.val)) && swapcache)
3390 goto out_page;
3391
3392 page = ksm_might_need_to_copy(page, vma, vmf->address);
3393 if (unlikely(!page)) {
3394 ret = VM_FAULT_OOM;
3395 page = swapcache;
3396 goto out_page;
3397 }
3398
3399 cgroup_throttle_swaprate(page, GFP_KERNEL);
3400
3401 /*
3402 * Back out if somebody else already faulted in this pte.
3403 */
3404 vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
3405 &vmf->ptl);
3406 if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte)))
3407 goto out_nomap;
3408
3409 if (unlikely(!PageUptodate(page))) {
3410 ret = VM_FAULT_SIGBUS;
3411 goto out_nomap;
3412 }
3413
3414 /*
3415 * The page isn't present yet, go ahead with the fault.
3416 *
3417 * Be careful about the sequence of operations here.
3418 * To get its accounting right, reuse_swap_page() must be called
3419 * while the page is counted on swap but not yet in mapcount i.e.
3420 * before page_add_anon_rmap() and swap_free(); try_to_free_swap()
3421 * must be called after the swap_free(), or it will never succeed.
3422 */
3423
3424 inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
3425 dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS);
3426 pte = mk_pte(page, vma->vm_page_prot);
3427 if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page, NULL)) {
3428 pte = maybe_mkwrite(pte_mkdirty(pte), vma);
3429 vmf->flags &= ~FAULT_FLAG_WRITE;
3430 ret |= VM_FAULT_WRITE;
3431 exclusive = RMAP_EXCLUSIVE;
3432 }
3433 flush_icache_page(vma, page);
3434 if (pte_swp_soft_dirty(vmf->orig_pte))
3435 pte = pte_mksoft_dirty(pte);
3436 if (pte_swp_uffd_wp(vmf->orig_pte)) {
3437 pte = pte_mkuffd_wp(pte);
3438 pte = pte_wrprotect(pte);
3439 }
3440 set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
3441 arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
3442 vmf->orig_pte = pte;
3443
3444 /* ksm created a completely new copy */
3445 if (unlikely(page != swapcache && swapcache)) {
3446 page_add_new_anon_rmap(page, vma, vmf->address, false);
3447 lru_cache_add_inactive_or_unevictable(page, vma);
3448 } else {
3449 do_page_add_anon_rmap(page, vma, vmf->address, exclusive);
3450 }
3451
3452 swap_free(entry);
3453 if (mem_cgroup_swap_full(page) ||
3454 (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
3455 try_to_free_swap(page);
3456 unlock_page(page);
3457 if (page != swapcache && swapcache) {
3458 /*
3459 * Hold the lock to avoid the swap entry to be reused
3460 * until we take the PT lock for the pte_same() check
3461 * (to avoid false positives from pte_same). For
3462 * further safety release the lock after the swap_free
3463 * so that the swap count won't change under a
3464 * parallel locked swapcache.
3465 */
3466 unlock_page(swapcache);
3467 put_page(swapcache);
3468 }
3469
3470 if (vmf->flags & FAULT_FLAG_WRITE) {
3471 ret |= do_wp_page(vmf);
3472 if (ret & VM_FAULT_ERROR)
3473 ret &= VM_FAULT_ERROR;
3474 goto out;
3475 }
3476
3477 /* No need to invalidate - it was non-present before */
3478 update_mmu_cache(vma, vmf->address, vmf->pte);
3479 unlock:
3480 pte_unmap_unlock(vmf->pte, vmf->ptl);
3481 out:
3482 if (si)
> 3483 put_swap_device(si);
3484 return ret;
3485 out_nomap:
3486 pte_unmap_unlock(vmf->pte, vmf->ptl);
3487 out_page:
3488 unlock_page(page);
3489 out_release:
3490 put_page(page);
3491 if (page != swapcache && swapcache) {
3492 unlock_page(swapcache);
3493 put_page(swapcache);
3494 }
3495 if (si)
3496 put_swap_device(si);
3497 return ret;
3498 }
3499
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org
[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 28245 bytes --]
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff
2021-04-08 13:08 ` [PATCH 2/5] swap: fix do_swap_page() race with swapoff Miaohe Lin
2021-04-08 21:34 ` Tim Chen
@ 2021-04-08 21:37 ` kernel test robot
2021-04-09 8:46 ` Miaohe Lin
2021-04-08 22:56 ` kernel test robot
2021-04-13 1:27 ` Huang, Ying
3 siblings, 1 reply; 24+ messages in thread
From: kernel test robot @ 2021-04-08 21:37 UTC (permalink / raw)
To: kbuild-all
[-- Attachment #1: Type: text/plain, Size: 11460 bytes --]
Hi Miaohe,
Thank you for the patch! Yet something to improve:
[auto build test ERROR on linux/master]
[also build test ERROR on linus/master hnaz-linux-mm/master v5.12-rc6 next-20210408]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]
url: https://github.com/0day-ci/linux/commits/Miaohe-Lin/close-various-race-windows-for-swap/20210408-211224
base: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 5e46d1b78a03d52306f21f77a4e4a144b6d31486
config: x86_64-randconfig-a012-20210408 (attached as .config)
compiler: clang version 13.0.0 (https://github.com/llvm/llvm-project 56ea2e2fdd691136d5e6631fa0e447173694b82c)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# install x86_64 cross compiling tool for clang build
# apt-get install binutils-x86-64-linux-gnu
# https://github.com/0day-ci/linux/commit/56e65e21c8c9858e36c3bca84006a15fe9b85efd
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Miaohe-Lin/close-various-race-windows-for-swap/20210408-211224
git checkout 56e65e21c8c9858e36c3bca84006a15fe9b85efd
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=x86_64
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>
All error/warnings (new ones prefixed by >>):
>> mm/memory.c:3300:7: error: implicit declaration of function 'get_swap_device' [-Werror,-Wimplicit-function-declaration]
si = get_swap_device(entry);
^
mm/memory.c:3300:7: note: did you mean 'get_cpu_device'?
include/linux/cpu.h:38:23: note: 'get_cpu_device' declared here
extern struct device *get_cpu_device(unsigned cpu);
^
>> mm/memory.c:3300:5: warning: incompatible integer to pointer conversion assigning to 'struct swap_info_struct *' from 'int' [-Wint-conversion]
si = get_swap_device(entry);
^ ~~~~~~~~~~~~~~~~~~~~~~
>> mm/memory.c:3483:3: error: implicit declaration of function 'put_swap_device' [-Werror,-Wimplicit-function-declaration]
put_swap_device(si);
^
mm/memory.c:3483:3: note: did you mean 'get_swap_device'?
mm/memory.c:3300:7: note: 'get_swap_device' declared here
si = get_swap_device(entry);
^
1 warning and 2 errors generated.
vim +/get_swap_device +3300 mm/memory.c
3258
3259 /*
3260 * We enter with non-exclusive mmap_lock (to exclude vma changes,
3261 * but allow concurrent faults), and pte mapped but not yet locked.
3262 * We return with pte unmapped and unlocked.
3263 *
3264 * We return with the mmap_lock locked or unlocked in the same cases
3265 * as does filemap_fault().
3266 */
3267 vm_fault_t do_swap_page(struct vm_fault *vmf)
3268 {
3269 struct vm_area_struct *vma = vmf->vma;
3270 struct page *page = NULL, *swapcache;
3271 struct swap_info_struct *si = NULL;
3272 swp_entry_t entry;
3273 pte_t pte;
3274 int locked;
3275 int exclusive = 0;
3276 vm_fault_t ret = 0;
3277 void *shadow = NULL;
3278
3279 if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte))
3280 goto out;
3281
3282 entry = pte_to_swp_entry(vmf->orig_pte);
3283 if (unlikely(non_swap_entry(entry))) {
3284 if (is_migration_entry(entry)) {
3285 migration_entry_wait(vma->vm_mm, vmf->pmd,
3286 vmf->address);
3287 } else if (is_device_private_entry(entry)) {
3288 vmf->page = device_private_entry_to_page(entry);
3289 ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
3290 } else if (is_hwpoison_entry(entry)) {
3291 ret = VM_FAULT_HWPOISON;
3292 } else {
3293 print_bad_pte(vma, vmf->address, vmf->orig_pte, NULL);
3294 ret = VM_FAULT_SIGBUS;
3295 }
3296 goto out;
3297 }
3298
3299
> 3300 si = get_swap_device(entry);
3301 /* In case we raced with swapoff. */
3302 if (unlikely(!si))
3303 goto out;
3304
3305 delayacct_set_flag(DELAYACCT_PF_SWAPIN);
3306 page = lookup_swap_cache(entry, vma, vmf->address);
3307 swapcache = page;
3308
3309 if (!page) {
3310 struct swap_info_struct *si = swp_swap_info(entry);
3311
3312 if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
3313 __swap_count(entry) == 1) {
3314 /* skip swapcache */
3315 page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
3316 vmf->address);
3317 if (page) {
3318 int err;
3319
3320 __SetPageLocked(page);
3321 __SetPageSwapBacked(page);
3322 set_page_private(page, entry.val);
3323
3324 /* Tell memcg to use swap ownership records */
3325 SetPageSwapCache(page);
3326 err = mem_cgroup_charge(page, vma->vm_mm,
3327 GFP_KERNEL);
3328 ClearPageSwapCache(page);
3329 if (err) {
3330 ret = VM_FAULT_OOM;
3331 goto out_page;
3332 }
3333
3334 shadow = get_shadow_from_swap_cache(entry);
3335 if (shadow)
3336 workingset_refault(page, shadow);
3337
3338 lru_cache_add(page);
3339 swap_readpage(page, true);
3340 }
3341 } else {
3342 page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
3343 vmf);
3344 swapcache = page;
3345 }
3346
3347 if (!page) {
3348 /*
3349 * Back out if somebody else faulted in this pte
3350 * while we released the pte lock.
3351 */
3352 vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
3353 vmf->address, &vmf->ptl);
3354 if (likely(pte_same(*vmf->pte, vmf->orig_pte)))
3355 ret = VM_FAULT_OOM;
3356 delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
3357 goto unlock;
3358 }
3359
3360 /* Had to read the page from swap area: Major fault */
3361 ret = VM_FAULT_MAJOR;
3362 count_vm_event(PGMAJFAULT);
3363 count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
3364 } else if (PageHWPoison(page)) {
3365 /*
3366 * hwpoisoned dirty swapcache pages are kept for killing
3367 * owner processes (which may be unknown at hwpoison time)
3368 */
3369 ret = VM_FAULT_HWPOISON;
3370 delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
3371 goto out_release;
3372 }
3373
3374 locked = lock_page_or_retry(page, vma->vm_mm, vmf->flags);
3375
3376 delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
3377 if (!locked) {
3378 ret |= VM_FAULT_RETRY;
3379 goto out_release;
3380 }
3381
3382 /*
3383 * Make sure try_to_free_swap or reuse_swap_page or swapoff did not
3384 * release the swapcache from under us. The page pin, and pte_same
3385 * test below, are not enough to exclude that. Even if it is still
3386 * swapcache, we need to check that the page's swap has not changed.
3387 */
3388 if (unlikely((!PageSwapCache(page) ||
3389 page_private(page) != entry.val)) && swapcache)
3390 goto out_page;
3391
3392 page = ksm_might_need_to_copy(page, vma, vmf->address);
3393 if (unlikely(!page)) {
3394 ret = VM_FAULT_OOM;
3395 page = swapcache;
3396 goto out_page;
3397 }
3398
3399 cgroup_throttle_swaprate(page, GFP_KERNEL);
3400
3401 /*
3402 * Back out if somebody else already faulted in this pte.
3403 */
3404 vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
3405 &vmf->ptl);
3406 if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte)))
3407 goto out_nomap;
3408
3409 if (unlikely(!PageUptodate(page))) {
3410 ret = VM_FAULT_SIGBUS;
3411 goto out_nomap;
3412 }
3413
3414 /*
3415 * The page isn't present yet, go ahead with the fault.
3416 *
3417 * Be careful about the sequence of operations here.
3418 * To get its accounting right, reuse_swap_page() must be called
3419 * while the page is counted on swap but not yet in mapcount i.e.
3420 * before page_add_anon_rmap() and swap_free(); try_to_free_swap()
3421 * must be called after the swap_free(), or it will never succeed.
3422 */
3423
3424 inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
3425 dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS);
3426 pte = mk_pte(page, vma->vm_page_prot);
3427 if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page, NULL)) {
3428 pte = maybe_mkwrite(pte_mkdirty(pte), vma);
3429 vmf->flags &= ~FAULT_FLAG_WRITE;
3430 ret |= VM_FAULT_WRITE;
3431 exclusive = RMAP_EXCLUSIVE;
3432 }
3433 flush_icache_page(vma, page);
3434 if (pte_swp_soft_dirty(vmf->orig_pte))
3435 pte = pte_mksoft_dirty(pte);
3436 if (pte_swp_uffd_wp(vmf->orig_pte)) {
3437 pte = pte_mkuffd_wp(pte);
3438 pte = pte_wrprotect(pte);
3439 }
3440 set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
3441 arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
3442 vmf->orig_pte = pte;
3443
3444 /* ksm created a completely new copy */
3445 if (unlikely(page != swapcache && swapcache)) {
3446 page_add_new_anon_rmap(page, vma, vmf->address, false);
3447 lru_cache_add_inactive_or_unevictable(page, vma);
3448 } else {
3449 do_page_add_anon_rmap(page, vma, vmf->address, exclusive);
3450 }
3451
3452 swap_free(entry);
3453 if (mem_cgroup_swap_full(page) ||
3454 (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
3455 try_to_free_swap(page);
3456 unlock_page(page);
3457 if (page != swapcache && swapcache) {
3458 /*
3459 * Hold the lock to avoid the swap entry to be reused
3460 * until we take the PT lock for the pte_same() check
3461 * (to avoid false positives from pte_same). For
3462 * further safety release the lock after the swap_free
3463 * so that the swap count won't change under a
3464 * parallel locked swapcache.
3465 */
3466 unlock_page(swapcache);
3467 put_page(swapcache);
3468 }
3469
3470 if (vmf->flags & FAULT_FLAG_WRITE) {
3471 ret |= do_wp_page(vmf);
3472 if (ret & VM_FAULT_ERROR)
3473 ret &= VM_FAULT_ERROR;
3474 goto out;
3475 }
3476
3477 /* No need to invalidate - it was non-present before */
3478 update_mmu_cache(vma, vmf->address, vmf->pte);
3479 unlock:
3480 pte_unmap_unlock(vmf->pte, vmf->ptl);
3481 out:
3482 if (si)
> 3483 put_swap_device(si);
3484 return ret;
3485 out_nomap:
3486 pte_unmap_unlock(vmf->pte, vmf->ptl);
3487 out_page:
3488 unlock_page(page);
3489 out_release:
3490 put_page(page);
3491 if (page != swapcache && swapcache) {
3492 unlock_page(swapcache);
3493 put_page(swapcache);
3494 }
3495 if (si)
3496 put_swap_device(si);
3497 return ret;
3498 }
3499
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org
[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 35830 bytes --]
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff
2021-04-08 13:08 ` [PATCH 2/5] swap: fix do_swap_page() race with swapoff Miaohe Lin
@ 2021-04-08 21:34 ` Tim Chen
2021-04-09 8:42 ` Miaohe Lin
2021-04-08 21:37 ` kernel test robot
` (2 subsequent siblings)
3 siblings, 1 reply; 24+ messages in thread
From: Tim Chen @ 2021-04-08 21:34 UTC (permalink / raw)
To: Miaohe Lin, akpm
Cc: hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, willy, minchan,
richard.weiyang, ying.huang, hughd, linux-kernel, linux-mm
On 4/8/21 6:08 AM, Miaohe Lin wrote:
> When I was investigating the swap code, I found the below possible race
> window:
>
> CPU 1 CPU 2
> ----- -----
> do_swap_page
> synchronous swap_readpage
> alloc_page_vma
> swapoff
> release swap_file, bdev, or ...
Perhaps I'm missing something. The release of swap_file, bdev etc
happens after we have cleared the SWP_VALID bit in si->flags in destroy_swap_extents
if I read the swapoff code correctly.
> swap_readpage
> check sis->flags is ok
> access swap_file, bdev...[oops!]
> si->flags = 0
This happens after we clear the si->flags
synchronize_rcu()
release swap_file, bdev, in destroy_swap_extents()
So I think if we have get_swap_device/put_swap_device in do_swap_page,
it should fix the race you've pointed out here.
Then synchronize_rcu() will wait till we have completed do_swap_page and
call put_swap_device.
>
> Using current get/put_swap_device() to guard against concurrent swapoff for
> swap_readpage() looks terrible because swap_readpage() may take really long
> time. And this race may not be really pernicious because swapoff is usually
> done when system shutdown only. To reduce the performance overhead on the
> hot-path as much as possible, it appears we can use the percpu_ref to close
> this race window(as suggested by Huang, Ying).
I think it is better to break this patch into two.
One patch is to fix the race in do_swap_page and swapoff
by adding get_swap_device/put_swap_device in do_swap_page.
The second patch is to modify get_swap_device and put_swap_device
with percpu_ref. But swapoff is a relatively rare events.
I am not sure making percpu_ref change for performance is really beneficial.
Did you encounter a real use case where you see a problem with swapoff?
The delay in swapoff is primarily in try_to_unuse to bring all
the swapped off pages back into memory. Synchronizing with other
CPU for paging in probably is a small component in overall scheme
of things.
Thanks.
Tim
^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH 2/5] swap: fix do_swap_page() race with swapoff
2021-04-08 13:08 [PATCH 0/5] close various race windows for swap Miaohe Lin
@ 2021-04-08 13:08 ` Miaohe Lin
2021-04-08 21:34 ` Tim Chen
` (3 more replies)
0 siblings, 4 replies; 24+ messages in thread
From: Miaohe Lin @ 2021-04-08 13:08 UTC (permalink / raw)
To: akpm
Cc: hannes, mhocko, iamjoonsoo.kim, vbabka, alex.shi, willy, minchan,
richard.weiyang, ying.huang, hughd, tim.c.chen, linux-kernel,
linux-mm, linmiaohe
When I was investigating the swap code, I found the below possible race
window:
CPU 1 CPU 2
----- -----
do_swap_page
synchronous swap_readpage
alloc_page_vma
swapoff
release swap_file, bdev, or ...
swap_readpage
check sis->flags is ok
access swap_file, bdev...[oops!]
si->flags = 0
Using current get/put_swap_device() to guard against concurrent swapoff for
swap_readpage() looks terrible because swap_readpage() may take really long
time. And this race may not be really pernicious because swapoff is usually
done when system shutdown only. To reduce the performance overhead on the
hot-path as much as possible, it appears we can use the percpu_ref to close
this race window(as suggested by Huang, Ying).
Fixes: 235b62176712 ("mm/swap: add cluster lock")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
---
include/linux/swap.h | 2 +-
mm/memory.c | 10 ++++++++++
mm/swapfile.c | 28 +++++++++++-----------------
3 files changed, 22 insertions(+), 18 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 849ba5265c11..9066addb57fd 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -513,7 +513,7 @@ sector_t swap_page_sector(struct page *page);
static inline void put_swap_device(struct swap_info_struct *si)
{
- rcu_read_unlock();
+ percpu_ref_put(&si->users);
}
#else /* CONFIG_SWAP */
diff --git a/mm/memory.c b/mm/memory.c
index cc71a445c76c..8543c47b955c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3311,6 +3311,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;
struct page *page = NULL, *swapcache;
+ struct swap_info_struct *si = NULL;
swp_entry_t entry;
pte_t pte;
int locked;
@@ -3339,6 +3340,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
}
+ si = get_swap_device(entry);
+ /* In case we raced with swapoff. */
+ if (unlikely(!si))
+ goto out;
+
delayacct_set_flag(DELAYACCT_PF_SWAPIN);
page = lookup_swap_cache(entry, vma, vmf->address);
swapcache = page;
@@ -3514,6 +3520,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
unlock:
pte_unmap_unlock(vmf->pte, vmf->ptl);
out:
+ if (si)
+ put_swap_device(si);
return ret;
out_nomap:
pte_unmap_unlock(vmf->pte, vmf->ptl);
@@ -3525,6 +3533,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
unlock_page(swapcache);
put_page(swapcache);
}
+ if (si)
+ put_swap_device(si);
return ret;
}
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 724173cd7d0c..01032c72ceae 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1280,18 +1280,12 @@ static unsigned char __swap_entry_free_locked(struct swap_info_struct *p,
* via preventing the swap device from being swapoff, until
* put_swap_device() is called. Otherwise return NULL.
*
- * The entirety of the RCU read critical section must come before the
- * return from or after the call to synchronize_rcu() in
- * enable_swap_info() or swapoff(). So if "si->flags & SWP_VALID" is
- * true, the si->map, si->cluster_info, etc. must be valid in the
- * critical section.
- *
* Notice that swapoff or swapoff+swapon can still happen before the
- * rcu_read_lock() in get_swap_device() or after the rcu_read_unlock()
- * in put_swap_device() if there isn't any other way to prevent
- * swapoff, such as page lock, page table lock, etc. The caller must
- * be prepared for that. For example, the following situation is
- * possible.
+ * percpu_ref_tryget_live() in get_swap_device() or after the
+ * percpu_ref_put() in put_swap_device() if there isn't any other way
+ * to prevent swapoff, such as page lock, page table lock, etc. The
+ * caller must be prepared for that. For example, the following
+ * situation is possible.
*
* CPU1 CPU2
* do_swap_page()
@@ -1319,21 +1313,21 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry)
si = swp_swap_info(entry);
if (!si)
goto bad_nofile;
-
- rcu_read_lock();
if (data_race(!(si->flags & SWP_VALID)))
- goto unlock_out;
+ goto out;
+ if (!percpu_ref_tryget_live(&si->users))
+ goto out;
offset = swp_offset(entry);
if (offset >= si->max)
- goto unlock_out;
+ goto put_out;
return si;
bad_nofile:
pr_err("%s: %s%08lx\n", __func__, Bad_file, entry.val);
out:
return NULL;
-unlock_out:
- rcu_read_unlock();
+put_out:
+ percpu_ref_put(&si->users);
return NULL;
}
--
2.19.1
^ permalink raw reply related [flat|nested] 24+ messages in thread
end of thread, other threads:[~2021-04-15 3:19 UTC | newest]
Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-08 20:46 [PATCH 2/5] swap: fix do_swap_page() race with swapoff kernel test robot
-- strict thread matches above, loose matches on Subject: below --
2021-04-08 13:08 [PATCH 0/5] close various race windows for swap Miaohe Lin
2021-04-08 13:08 ` [PATCH 2/5] swap: fix do_swap_page() race with swapoff Miaohe Lin
2021-04-08 21:34 ` Tim Chen
2021-04-09 8:42 ` Miaohe Lin
2021-04-09 17:17 ` Tim Chen
2021-04-10 3:17 ` Miaohe Lin
2021-04-12 1:44 ` Huang, Ying
2021-04-12 1:44 ` Huang, Ying
2021-04-12 3:24 ` Miaohe Lin
2021-04-08 21:37 ` kernel test robot
2021-04-09 8:46 ` Miaohe Lin
2021-04-08 22:56 ` kernel test robot
2021-04-13 1:27 ` Huang, Ying
2021-04-13 1:27 ` Huang, Ying
2021-04-13 19:24 ` Tim Chen
2021-04-14 1:04 ` Huang, Ying
2021-04-14 1:04 ` Huang, Ying
2021-04-14 2:20 ` Miaohe Lin
2021-04-14 16:13 ` Tim Chen
2021-04-15 3:19 ` Miaohe Lin
2021-04-14 2:55 ` Miaohe Lin
2021-04-14 3:07 ` Huang, Ying
2021-04-14 3:07 ` Huang, Ying
2021-04-14 3:27 ` Miaohe Lin
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.