[PATCH] mm: fix the race between swapin_readahead and SWP_SYNCHRONOUS

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] mm: fix the race between swapin_readahead and SWP_SYNCHRONOUS_IO path
@ 2019-08-30 12:43 Vinayak Menon
  2019-09-02 13:21 ` Michal Hocko
  2019-09-09 23:26 ` Minchan Kim
  0 siblings, 2 replies; 16+ messages in thread
From: Vinayak Menon @ 2019-08-30 12:43 UTC (permalink / raw)
  To: minchan, linux-mm; +Cc: Vinayak Menon

The following race is observed due to which a processes faulting
on a swap entry, finds the page neither in swapcache nor swap. This
causes zram to give a zero filled page that gets mapped to the
process, resulting in a user space crash later.

Consider parent and child processes Pa and Pb sharing the same swap
slot with swap_count 2. Swap is on zram with SWP_SYNCHRONOUS_IO set.
Virtual address 'VA' of Pa and Pb points to the shared swap entry.

Pa                                       Pb

fault on VA                              fault on VA
do_swap_page                             do_swap_page
lookup_swap_cache fails                  lookup_swap_cache fails
                                         Pb scheduled out
swapin_readahead (deletes zram entry)
swap_free (makes swap_count 1)
                                         Pb scheduled in
                                         swap_readpage (swap_count == 1)
                                         Takes SWP_SYNCHRONOUS_IO path
                                         zram enrty absent
                                         zram gives a zero filled page

Fix this by reading the swap_count before lookup_swap_cache, which conforms
with the order in which page is added to swap cache and swap count is
decremented in do_swap_page. In the race case above, this will let Pb take
the readahead path and thus pick the proper page from swapcache.

Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
---
 mm/memory.c | 21 ++++++++++++++++-----
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index e0c232f..22643aa 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2744,6 +2744,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	struct page *page = NULL, *swapcache;
 	struct mem_cgroup *memcg;
 	swp_entry_t entry;
+	struct swap_info_struct *si;
+	bool skip_swapcache = false;
 	pte_t pte;
 	int locked;
 	int exclusive = 0;
@@ -2771,15 +2773,24 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 
 
 	delayacct_set_flag(DELAYACCT_PF_SWAPIN);
+
+	/*
+	 * lookup_swap_cache below can fail and before the SWP_SYNCHRONOUS_IO
+	 * check is made, another process can populate the swapcache, delete
+	 * the swap entry and decrement the swap count. So decide on taking
+	 * the SWP_SYNCHRONOUS_IO path before the lookup. In the event of the
+	 * race described, the victim process will find a swap_count > 1
+	 * and can then take the readahead path instead of SWP_SYNCHRONOUS_IO.
+	 */
+	si = swp_swap_info(entry);
+	if (si->flags & SWP_SYNCHRONOUS_IO && __swap_count(entry) == 1)
+		skip_swapcache = true;
+
 	page = lookup_swap_cache(entry, vma, vmf->address);
 	swapcache = page;
 
 	if (!page) {
-		struct swap_info_struct *si = swp_swap_info(entry);
-
-		if (si->flags & SWP_SYNCHRONOUS_IO &&
-				__swap_count(entry) == 1) {
-			/* skip swapcache */
+		if (skip_swapcache) {
 			page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
 							vmf->address);
 			if (page) {
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: fix the race between swapin_readahead and SWP_SYNCHRONOUS_IO path
  2019-08-30 12:43 [PATCH] mm: fix the race between swapin_readahead and SWP_SYNCHRONOUS_IO path Vinayak Menon
@ 2019-09-02 13:21 ` Michal Hocko
  2019-09-03  6:13   ` Vinayak Menon
  2019-09-09 23:26 ` Minchan Kim
  1 sibling, 1 reply; 16+ messages in thread
From: Michal Hocko @ 2019-09-02 13:21 UTC (permalink / raw)
  To: Vinayak Menon; +Cc: minchan, linux-mm

On Fri 30-08-19 18:13:31, Vinayak Menon wrote:
> The following race is observed due to which a processes faulting
> on a swap entry, finds the page neither in swapcache nor swap. This
> causes zram to give a zero filled page that gets mapped to the
> process, resulting in a user space crash later.
> 
> Consider parent and child processes Pa and Pb sharing the same swap
> slot with swap_count 2. Swap is on zram with SWP_SYNCHRONOUS_IO set.
> Virtual address 'VA' of Pa and Pb points to the shared swap entry.
> 
> Pa                                       Pb
> 
> fault on VA                              fault on VA
> do_swap_page                             do_swap_page
> lookup_swap_cache fails                  lookup_swap_cache fails
>                                          Pb scheduled out
> swapin_readahead (deletes zram entry)
> swap_free (makes swap_count 1)
>                                          Pb scheduled in
>                                          swap_readpage (swap_count == 1)
>                                          Takes SWP_SYNCHRONOUS_IO path
>                                          zram enrty absent
>                                          zram gives a zero filled page

This sounds like a zram issue, right? Why is a generic swap path changed
then?

> 
> Fix this by reading the swap_count before lookup_swap_cache, which conforms
> with the order in which page is added to swap cache and swap count is
> decremented in do_swap_page. In the race case above, this will let Pb take
> the readahead path and thus pick the proper page from swapcache.
> 
> Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: fix the race between swapin_readahead and SWP_SYNCHRONOUS_IO path
  2019-09-02 13:21 ` Michal Hocko
@ 2019-09-03  6:13   ` Vinayak Menon
  2019-09-03 11:41     ` Michal Hocko
  0 siblings, 1 reply; 16+ messages in thread
From: Vinayak Menon @ 2019-09-03  6:13 UTC (permalink / raw)
  To: Michal Hocko; +Cc: minchan, linux-mm

Hi Michal,

Thanks for reviewing this.


On 9/2/2019 6:51 PM, Michal Hocko wrote:
> On Fri 30-08-19 18:13:31, Vinayak Menon wrote:
>> The following race is observed due to which a processes faulting
>> on a swap entry, finds the page neither in swapcache nor swap. This
>> causes zram to give a zero filled page that gets mapped to the
>> process, resulting in a user space crash later.
>>
>> Consider parent and child processes Pa and Pb sharing the same swap
>> slot with swap_count 2. Swap is on zram with SWP_SYNCHRONOUS_IO set.
>> Virtual address 'VA' of Pa and Pb points to the shared swap entry.
>>
>> Pa                                       Pb
>>
>> fault on VA                              fault on VA
>> do_swap_page                             do_swap_page
>> lookup_swap_cache fails                  lookup_swap_cache fails
>>                                          Pb scheduled out
>> swapin_readahead (deletes zram entry)
>> swap_free (makes swap_count 1)
>>                                          Pb scheduled in
>>                                          swap_readpage (swap_count == 1)
>>                                          Takes SWP_SYNCHRONOUS_IO path
>>                                          zram enrty absent
>>                                          zram gives a zero filled page
> This sounds like a zram issue, right? Why is a generic swap path changed
> then?


I think zram entry being deleted by Pa and zram giving out a zeroed page to Pb is normal.

This is because zram avoids lazy swap slot freeing by implementing gendisk->fops->swap_slot_free_notify

and swap_slot_free_notify deletes the zram entry because the page is in swapcache.

The issue is that Pb attempted a swapcache lookup before Pa brought the page to swapcache, and failed. If

Pb had taken the swapin_readahead path, __read_swap_cache_async would have performed a second lookup

and found the page in swapcache. The issue here is that due to the lookup failure and swap_count being 1,

it takes the  SWP_SYNCHRONOUS_IO path and does a direct read which is bound to fail. So it seems to me as

a problem in the way SWP_SYNCHRONOUS_IO is handled in do_swap_page, and not a problem with zram.

Any swap device that sets SWP_SYNCHRONOUS_IO and implements swap_slot_free_notify can hit this bug.

do_swap_page first brings in the page to swapcache and then decrements the swap_count, and SWP_SYNCHRONOUS_IO

code in do_swap_page performs the swapcache and swap_count checks in the same order. Due to thread preemption

described in the sequence above, it can happen that the SWP_SYNCHRONOUS_IO path fails in swapcache check, but sees

the swap_count decremented later, thus missing a valid swapcache entry.

I have not tested, but the following patch may also fix the issue.


diff --git a/include/linux/swap.h b/include/linux/swap.h
index 063c0c1..a5ca05f 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -463,6 +463,7 @@ extern sector_t map_swap_page(struct page *, struct block_device **);
 extern sector_t swapdev_block(int, pgoff_t);
 extern int page_swapcount(struct page *);
 extern int __swap_count(swp_entry_t entry);
+extern bool __swap_has_cache(swp_entry_t entry);
 extern int __swp_swapcount(swp_entry_t entry);
 extern int swp_swapcount(swp_entry_t entry);
 extern struct swap_info_struct *page_swap_info(struct page *);
@@ -589,6 +590,11 @@ static inline int __swap_count(swp_entry_t entry)
        return 0;
 }

+static bool __swap_has_cache(swp_entry_t entry)
+{
+       return 0;
+}
+
 static inline int __swp_swapcount(swp_entry_t entry)
 {
        return 0;

diff --git a/mm/memory.c b/mm/memory.c
index e0c232f..a13511f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2778,7 +2778,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
                struct swap_info_struct *si = swp_swap_info(entry);

                if (si->flags & SWP_SYNCHRONOUS_IO &&
-                               __swap_count(entry) == 1) {
+                               __swap_count(entry) == 1 &&
+                               !__swap_has_cache(entry)) {
                        /* skip swapcache */
                        page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
                                                        vmf->address);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 80445f4..2a1554a8 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1459,6 +1459,20 @@ int __swap_count(swp_entry_t entry)
        return count;
 }

+bool __swap_has_cache(swp_entry_t entry)
+{
+       struct swap_info_struct *si;
+       pgoff_t offset = swp_offset(entry);
+       bool has_cache  = false;
+
+       si = get_swap_device(entry);
+       if (si) {
+               has_cache = !!(si->swap_map[offset] & SWAP_HAS_CACHE);
+               put_swap_device(si);
+       }
+       return has_cache;
+}
+
 static int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry)
 {
        int count = 0;


>
>> Fix this by reading the swap_count before lookup_swap_cache, which conforms
>> with the order in which page is added to swap cache and swap count is
>> decremented in do_swap_page. In the race case above, this will let Pb take
>> the readahead path and thus pick the proper page from swapcache.
>>
>> Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: fix the race between swapin_readahead and SWP_SYNCHRONOUS_IO path
  2019-09-03  6:13   ` Vinayak Menon
@ 2019-09-03 11:41     ` Michal Hocko
  2019-09-03 12:17       ` Vinayak Menon
  0 siblings, 1 reply; 16+ messages in thread
From: Michal Hocko @ 2019-09-03 11:41 UTC (permalink / raw)
  To: Vinayak Menon; +Cc: minchan, linux-mm

On Tue 03-09-19 11:43:16, Vinayak Menon wrote:
> Hi Michal,
> 
> Thanks for reviewing this.
> 
> 
> On 9/2/2019 6:51 PM, Michal Hocko wrote:
> > On Fri 30-08-19 18:13:31, Vinayak Menon wrote:
> >> The following race is observed due to which a processes faulting
> >> on a swap entry, finds the page neither in swapcache nor swap. This
> >> causes zram to give a zero filled page that gets mapped to the
> >> process, resulting in a user space crash later.
> >>
> >> Consider parent and child processes Pa and Pb sharing the same swap
> >> slot with swap_count 2. Swap is on zram with SWP_SYNCHRONOUS_IO set.
> >> Virtual address 'VA' of Pa and Pb points to the shared swap entry.
> >>
> >> Pa                                       Pb
> >>
> >> fault on VA                              fault on VA
> >> do_swap_page                             do_swap_page
> >> lookup_swap_cache fails                  lookup_swap_cache fails
> >>                                          Pb scheduled out
> >> swapin_readahead (deletes zram entry)
> >> swap_free (makes swap_count 1)
> >>                                          Pb scheduled in
> >>                                          swap_readpage (swap_count == 1)
> >>                                          Takes SWP_SYNCHRONOUS_IO path
> >>                                          zram enrty absent
> >>                                          zram gives a zero filled page
> > This sounds like a zram issue, right? Why is a generic swap path changed
> > then?
> 
> 
> I think zram entry being deleted by Pa and zram giving out a zeroed page to Pb is normal.

Isn't that a data loss? The race you mentioned shouldn't be possible
with the standard swap storage AFAIU. If that is really the case then
the zram needs a fix rather than a generic path. Or at least a very good
explanation why the generic path is a preferred way.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: fix the race between swapin_readahead and SWP_SYNCHRONOUS_IO path
  2019-09-03 11:41     ` Michal Hocko
@ 2019-09-03 12:17       ` Vinayak Menon
  2019-09-09  4:05         ` Vinayak Menon
  0 siblings, 1 reply; 16+ messages in thread
From: Vinayak Menon @ 2019-09-03 12:17 UTC (permalink / raw)
  To: Michal Hocko; +Cc: minchan, linux-mm


On 9/3/2019 5:11 PM, Michal Hocko wrote:
> On Tue 03-09-19 11:43:16, Vinayak Menon wrote:
>> Hi Michal,
>>
>> Thanks for reviewing this.
>>
>>
>> On 9/2/2019 6:51 PM, Michal Hocko wrote:
>>> On Fri 30-08-19 18:13:31, Vinayak Menon wrote:
>>>> The following race is observed due to which a processes faulting
>>>> on a swap entry, finds the page neither in swapcache nor swap. This
>>>> causes zram to give a zero filled page that gets mapped to the
>>>> process, resulting in a user space crash later.
>>>>
>>>> Consider parent and child processes Pa and Pb sharing the same swap
>>>> slot with swap_count 2. Swap is on zram with SWP_SYNCHRONOUS_IO set.
>>>> Virtual address 'VA' of Pa and Pb points to the shared swap entry.
>>>>
>>>> Pa                                       Pb
>>>>
>>>> fault on VA                              fault on VA
>>>> do_swap_page                             do_swap_page
>>>> lookup_swap_cache fails                  lookup_swap_cache fails
>>>>                                          Pb scheduled out
>>>> swapin_readahead (deletes zram entry)
>>>> swap_free (makes swap_count 1)
>>>>                                          Pb scheduled in
>>>>                                          swap_readpage (swap_count == 1)
>>>>                                          Takes SWP_SYNCHRONOUS_IO path
>>>>                                          zram enrty absent
>>>>                                          zram gives a zero filled page
>>> This sounds like a zram issue, right? Why is a generic swap path changed
>>> then?
>>
>> I think zram entry being deleted by Pa and zram giving out a zeroed page to Pb is normal.
> Isn't that a data loss? The race you mentioned shouldn't be possible
> with the standard swap storage AFAIU. If that is really the case then
> the zram needs a fix rather than a generic path. Or at least a very good
> explanation why the generic path is a preferred way.


AFAIK, there isn't a data loss because, before deleting the entry, swap_slot_free_notify makes sure that

page is in swapcache and marks the page dirty to ensure a swap out before reclaim. I am referring to the

comment about this in swap_slot_free_notify. In the case of this race too, the page brought to swapcache

by Pa is still in swapcache. It is just that Pb failed to find it due to the race.

Yes, this race will not happen for standard swap storage and only for those block devices that set

disk->fops->swap_slot_free_notify and have SWP_SYNCHRONOUS_IO set (which seems to be only zram).

Now considering that zram works as expected, the fix is in generic path because the race is due to the bug in

SWP_SYNCHRONOUS_IO handling in do_swap_page. And it is only the SWP_SYNCHRONOUS_IO handling in

generic path which is modified.




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: fix the race between swapin_readahead and SWP_SYNCHRONOUS_IO path
  2019-09-03 12:17       ` Vinayak Menon
@ 2019-09-09  4:05         ` Vinayak Menon
  2019-09-09 11:23           ` Michal Hocko
  0 siblings, 1 reply; 16+ messages in thread
From: Vinayak Menon @ 2019-09-09  4:05 UTC (permalink / raw)
  To: Michal Hocko; +Cc: minchan, linux-mm


On 9/3/2019 5:47 PM, Vinayak Menon wrote:
> On 9/3/2019 5:11 PM, Michal Hocko wrote:
>> On Tue 03-09-19 11:43:16, Vinayak Menon wrote:
>>> Hi Michal,
>>>
>>> Thanks for reviewing this.
>>>
>>>
>>> On 9/2/2019 6:51 PM, Michal Hocko wrote:
>>>> On Fri 30-08-19 18:13:31, Vinayak Menon wrote:
>>>>> The following race is observed due to which a processes faulting
>>>>> on a swap entry, finds the page neither in swapcache nor swap. This
>>>>> causes zram to give a zero filled page that gets mapped to the
>>>>> process, resulting in a user space crash later.
>>>>>
>>>>> Consider parent and child processes Pa and Pb sharing the same swap
>>>>> slot with swap_count 2. Swap is on zram with SWP_SYNCHRONOUS_IO set.
>>>>> Virtual address 'VA' of Pa and Pb points to the shared swap entry.
>>>>>
>>>>> Pa                                       Pb
>>>>>
>>>>> fault on VA                              fault on VA
>>>>> do_swap_page                             do_swap_page
>>>>> lookup_swap_cache fails                  lookup_swap_cache fails
>>>>>                                          Pb scheduled out
>>>>> swapin_readahead (deletes zram entry)
>>>>> swap_free (makes swap_count 1)
>>>>>                                          Pb scheduled in
>>>>>                                          swap_readpage (swap_count == 1)
>>>>>                                          Takes SWP_SYNCHRONOUS_IO path
>>>>>                                          zram enrty absent
>>>>>                                          zram gives a zero filled page
>>>> This sounds like a zram issue, right? Why is a generic swap path changed
>>>> then?
>>> I think zram entry being deleted by Pa and zram giving out a zeroed page to Pb is normal.
>> Isn't that a data loss? The race you mentioned shouldn't be possible
>> with the standard swap storage AFAIU. If that is really the case then
>> the zram needs a fix rather than a generic path. Or at least a very good
>> explanation why the generic path is a preferred way.
>
> AFAIK, there isn't a data loss because, before deleting the entry, swap_slot_free_notify makes sure that
>
> page is in swapcache and marks the page dirty to ensure a swap out before reclaim. I am referring to the
>
> comment about this in swap_slot_free_notify. In the case of this race too, the page brought to swapcache
>
> by Pa is still in swapcache. It is just that Pb failed to find it due to the race.
>
> Yes, this race will not happen for standard swap storage and only for those block devices that set
>
> disk->fops->swap_slot_free_notify and have SWP_SYNCHRONOUS_IO set (which seems to be only zram).
>
> Now considering that zram works as expected, the fix is in generic path because the race is due to the bug in
>
> SWP_SYNCHRONOUS_IO handling in do_swap_page. And it is only the SWP_SYNCHRONOUS_IO handling in
>
> generic path which is modified.
>

Hi Michal,

Do you see any concerns with the patch or explanation of the problem ?





^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: fix the race between swapin_readahead and SWP_SYNCHRONOUS_IO path
  2019-09-09  4:05         ` Vinayak Menon
@ 2019-09-09 11:23           ` Michal Hocko
  0 siblings, 0 replies; 16+ messages in thread
From: Michal Hocko @ 2019-09-09 11:23 UTC (permalink / raw)
  To: Vinayak Menon; +Cc: minchan, linux-mm

On Mon 09-09-19 09:35:39, Vinayak Menon wrote:
> 
> On 9/3/2019 5:47 PM, Vinayak Menon wrote:
> > On 9/3/2019 5:11 PM, Michal Hocko wrote:
> >> On Tue 03-09-19 11:43:16, Vinayak Menon wrote:
> >>> Hi Michal,
> >>>
> >>> Thanks for reviewing this.
> >>>
> >>>
> >>> On 9/2/2019 6:51 PM, Michal Hocko wrote:
> >>>> On Fri 30-08-19 18:13:31, Vinayak Menon wrote:
> >>>>> The following race is observed due to which a processes faulting
> >>>>> on a swap entry, finds the page neither in swapcache nor swap. This
> >>>>> causes zram to give a zero filled page that gets mapped to the
> >>>>> process, resulting in a user space crash later.
> >>>>>
> >>>>> Consider parent and child processes Pa and Pb sharing the same swap
> >>>>> slot with swap_count 2. Swap is on zram with SWP_SYNCHRONOUS_IO set.
> >>>>> Virtual address 'VA' of Pa and Pb points to the shared swap entry.
> >>>>>
> >>>>> Pa                                       Pb
> >>>>>
> >>>>> fault on VA                              fault on VA
> >>>>> do_swap_page                             do_swap_page
> >>>>> lookup_swap_cache fails                  lookup_swap_cache fails
> >>>>>                                          Pb scheduled out
> >>>>> swapin_readahead (deletes zram entry)
> >>>>> swap_free (makes swap_count 1)
> >>>>>                                          Pb scheduled in
> >>>>>                                          swap_readpage (swap_count == 1)
> >>>>>                                          Takes SWP_SYNCHRONOUS_IO path
> >>>>>                                          zram enrty absent
> >>>>>                                          zram gives a zero filled page
> >>>> This sounds like a zram issue, right? Why is a generic swap path changed
> >>>> then?
> >>> I think zram entry being deleted by Pa and zram giving out a zeroed page to Pb is normal.
> >> Isn't that a data loss? The race you mentioned shouldn't be possible
> >> with the standard swap storage AFAIU. If that is really the case then
> >> the zram needs a fix rather than a generic path. Or at least a very good
> >> explanation why the generic path is a preferred way.
> >
> > AFAIK, there isn't a data loss because, before deleting the entry, swap_slot_free_notify makes sure that
> >
> > page is in swapcache and marks the page dirty to ensure a swap out before reclaim. I am referring to the
> >
> > comment about this in swap_slot_free_notify. In the case of this race too, the page brought to swapcache
> >
> > by Pa is still in swapcache. It is just that Pb failed to find it due to the race.
> >
> > Yes, this race will not happen for standard swap storage and only for those block devices that set
> >
> > disk->fops->swap_slot_free_notify and have SWP_SYNCHRONOUS_IO set (which seems to be only zram).
> >
> > Now considering that zram works as expected, the fix is in generic path because the race is due to the bug in
> >
> > SWP_SYNCHRONOUS_IO handling in do_swap_page. And it is only the SWP_SYNCHRONOUS_IO handling in
> >
> > generic path which is modified.
> >
> 
> Hi Michal,
> 
> Do you see any concerns with the patch or explanation of the problem ?

I am sorry, I didn't have time to give this a more serious thought. You
need somebody more familiar with the code and time to look into it.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: fix the race between swapin_readahead and SWP_SYNCHRONOUS_IO path
  2019-08-30 12:43 [PATCH] mm: fix the race between swapin_readahead and SWP_SYNCHRONOUS_IO path Vinayak Menon
  2019-09-02 13:21 ` Michal Hocko
@ 2019-09-09 23:26 ` Minchan Kim
  2019-09-10  8:22   ` Vinayak Menon
  1 sibling, 1 reply; 16+ messages in thread
From: Minchan Kim @ 2019-09-09 23:26 UTC (permalink / raw)
  To: Vinayak Menon; +Cc: linux-mm

Hi Vinayak,

On Fri, Aug 30, 2019 at 06:13:31PM +0530, Vinayak Menon wrote:
> The following race is observed due to which a processes faulting
> on a swap entry, finds the page neither in swapcache nor swap. This
> causes zram to give a zero filled page that gets mapped to the
> process, resulting in a user space crash later.
> 
> Consider parent and child processes Pa and Pb sharing the same swap
> slot with swap_count 2. Swap is on zram with SWP_SYNCHRONOUS_IO set.
> Virtual address 'VA' of Pa and Pb points to the shared swap entry.
> 
> Pa                                       Pb
> 
> fault on VA                              fault on VA
> do_swap_page                             do_swap_page
> lookup_swap_cache fails                  lookup_swap_cache fails
>                                          Pb scheduled out
> swapin_readahead (deletes zram entry)
> swap_free (makes swap_count 1)
>                                          Pb scheduled in
>                                          swap_readpage (swap_count == 1)
>                                          Takes SWP_SYNCHRONOUS_IO path
>                                          zram enrty absent
>                                          zram gives a zero filled page
> 
> Fix this by reading the swap_count before lookup_swap_cache, which conforms
> with the order in which page is added to swap cache and swap count is
> decremented in do_swap_page. In the race case above, this will let Pb take
> the readahead path and thus pick the proper page from swapcache.

Thanks for the report, Vinayak.

It's a zram specific issue because it deallocates zram block
unconditionally once read IO is done. The expectation was that dirty
page is on the swap cache but with SWP_SYNCHRONOUS_IO, it's not true
any more so I want to resolve the issue in zram specific code, not
general one.

A idea in my mind is swap_slot_free_notify should check the slot
reference counter and if it's higher than 1, it shouldn't free the
slot until. What do you think about?

> 
> Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
> ---
>  mm/memory.c | 21 ++++++++++++++++-----
>  1 file changed, 16 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index e0c232f..22643aa 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2744,6 +2744,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	struct page *page = NULL, *swapcache;
>  	struct mem_cgroup *memcg;
>  	swp_entry_t entry;
> +	struct swap_info_struct *si;
> +	bool skip_swapcache = false;
>  	pte_t pte;
>  	int locked;
>  	int exclusive = 0;
> @@ -2771,15 +2773,24 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  
>  
>  	delayacct_set_flag(DELAYACCT_PF_SWAPIN);
> +
> +	/*
> +	 * lookup_swap_cache below can fail and before the SWP_SYNCHRONOUS_IO
> +	 * check is made, another process can populate the swapcache, delete
> +	 * the swap entry and decrement the swap count. So decide on taking
> +	 * the SWP_SYNCHRONOUS_IO path before the lookup. In the event of the
> +	 * race described, the victim process will find a swap_count > 1
> +	 * and can then take the readahead path instead of SWP_SYNCHRONOUS_IO.
> +	 */
> +	si = swp_swap_info(entry);
> +	if (si->flags & SWP_SYNCHRONOUS_IO && __swap_count(entry) == 1)
> +		skip_swapcache = true;
> +
>  	page = lookup_swap_cache(entry, vma, vmf->address);
>  	swapcache = page;
>  
>  	if (!page) {
> -		struct swap_info_struct *si = swp_swap_info(entry);
> -
> -		if (si->flags & SWP_SYNCHRONOUS_IO &&
> -				__swap_count(entry) == 1) {
> -			/* skip swapcache */
> +		if (skip_swapcache) {
>  			page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
>  							vmf->address);
>  			if (page) {
> -- 
> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
> member of the Code Aurora Forum, hosted by The Linux Foundation
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: fix the race between swapin_readahead and SWP_SYNCHRONOUS_IO path
  2019-09-09 23:26 ` Minchan Kim
@ 2019-09-10  8:22   ` Vinayak Menon
  2019-09-10 17:51     ` Minchan Kim
  0 siblings, 1 reply; 16+ messages in thread
From: Vinayak Menon @ 2019-09-10  8:22 UTC (permalink / raw)
  To: Minchan Kim; +Cc: linux-mm

Hi Minchan,


On 9/10/2019 4:56 AM, Minchan Kim wrote:
> Hi Vinayak,
>
> On Fri, Aug 30, 2019 at 06:13:31PM +0530, Vinayak Menon wrote:
>> The following race is observed due to which a processes faulting
>> on a swap entry, finds the page neither in swapcache nor swap. This
>> causes zram to give a zero filled page that gets mapped to the
>> process, resulting in a user space crash later.
>>
>> Consider parent and child processes Pa and Pb sharing the same swap
>> slot with swap_count 2. Swap is on zram with SWP_SYNCHRONOUS_IO set.
>> Virtual address 'VA' of Pa and Pb points to the shared swap entry.
>>
>> Pa                                       Pb
>>
>> fault on VA                              fault on VA
>> do_swap_page                             do_swap_page
>> lookup_swap_cache fails                  lookup_swap_cache fails
>>                                          Pb scheduled out
>> swapin_readahead (deletes zram entry)
>> swap_free (makes swap_count 1)
>>                                          Pb scheduled in
>>                                          swap_readpage (swap_count == 1)
>>                                          Takes SWP_SYNCHRONOUS_IO path
>>                                          zram enrty absent
>>                                          zram gives a zero filled page
>>
>> Fix this by reading the swap_count before lookup_swap_cache, which conforms
>> with the order in which page is added to swap cache and swap count is
>> decremented in do_swap_page. In the race case above, this will let Pb take
>> the readahead path and thus pick the proper page from swapcache.
> Thanks for the report, Vinayak.
>
> It's a zram specific issue because it deallocates zram block
> unconditionally once read IO is done. The expectation was that dirty
> page is on the swap cache but with SWP_SYNCHRONOUS_IO, it's not true
> any more so I want to resolve the issue in zram specific code, not
> general one.


Thanks for comments Minchan.

Trying to understand your comment better.  With SWP_SYNCHRONOUS_IO also, swap_slot_free_notify will

make sure that it deletes the entry only if the page is in swapcache. Even in the current issue case, a valid

entry is present in the swapcache at the time of issue (brought in by Pa). Its just that Pb missed it due to the

race and tried to read again from zram. So thinking whether it is an issue with zram deleting the entry, or

SWP_SYNCHRONOUS_IO failing to find the valid swapcache entry. There isn't actually a case seen where zram

entry is deleted unconditionally, with some process yet to reference the slot and page is not in swapcache.


>
> A idea in my mind is swap_slot_free_notify should check the slot
> reference counter and if it's higher than 1, it shouldn't free the
> slot until. What do you think about?

It seems fine to me except for the fact that it will delay zram entry deletion for shared slots, which

can be significant sometimes. Also, should we fix this path as the issue is with SWP_SYNCHRONOUS_IO missing

a valid swapcache entry ?

Can swapcache check be done like below, before taking the SWP_SYNCHRONOUS_IO path, as an alternative ?


diff --git a/include/linux/swap.h b/include/linux/swap.h
index 063c0c1..a5ca05f 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -463,6 +463,7 @@ extern sector_t map_swap_page(struct page *, struct block_device **);
 extern sector_t swapdev_block(int, pgoff_t);
 extern int page_swapcount(struct page *);
 extern int __swap_count(swp_entry_t entry);
+extern bool __swap_has_cache(swp_entry_t entry);
 extern int __swp_swapcount(swp_entry_t entry);
 extern int swp_swapcount(swp_entry_t entry);
 extern struct swap_info_struct *page_swap_info(struct page *);
@@ -589,6 +590,11 @@ static inline int __swap_count(swp_entry_t entry)
        return 0;
 }

+static bool __swap_has_cache(swp_entry_t entry)
+{
+       return 0;
+}
+
 static inline int __swp_swapcount(swp_entry_t entry)
 {
        return 0;

diff --git a/mm/memory.c b/mm/memory.c
index e0c232f..a13511f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2778,7 +2778,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
                struct swap_info_struct *si = swp_swap_info(entry);

                if (si->flags & SWP_SYNCHRONOUS_IO &&
-                               __swap_count(entry) == 1) {
+                               __swap_count(entry) == 1 &&
+                               !__swap_has_cache(entry)) {
                        /* skip swapcache */
                        page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
                                                        vmf->address);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 80445f4..2a1554a8 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1459,6 +1459,20 @@ int __swap_count(swp_entry_t entry)
        return count;
 }

+bool __swap_has_cache(swp_entry_t entry)
+{
+       struct swap_info_struct *si;
+       pgoff_t offset = swp_offset(entry);
+       bool has_cache  = false;
+
+       si = get_swap_device(entry);
+       if (si) {
+               has_cache = !!(si->swap_map[offset] & SWAP_HAS_CACHE);
+               put_swap_device(si);
+       }
+       return has_cache;
+}
+
 static int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry)
 {
        int count = 0;


>
>> Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
>> ---
>>  mm/memory.c | 21 ++++++++++++++++-----
>>  1 file changed, 16 insertions(+), 5 deletions(-)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index e0c232f..22643aa 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -2744,6 +2744,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>  	struct page *page = NULL, *swapcache;
>>  	struct mem_cgroup *memcg;
>>  	swp_entry_t entry;
>> +	struct swap_info_struct *si;
>> +	bool skip_swapcache = false;
>>  	pte_t pte;
>>  	int locked;
>>  	int exclusive = 0;
>> @@ -2771,15 +2773,24 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>  
>>  
>>  	delayacct_set_flag(DELAYACCT_PF_SWAPIN);
>> +
>> +	/*
>> +	 * lookup_swap_cache below can fail and before the SWP_SYNCHRONOUS_IO
>> +	 * check is made, another process can populate the swapcache, delete
>> +	 * the swap entry and decrement the swap count. So decide on taking
>> +	 * the SWP_SYNCHRONOUS_IO path before the lookup. In the event of the
>> +	 * race described, the victim process will find a swap_count > 1
>> +	 * and can then take the readahead path instead of SWP_SYNCHRONOUS_IO.
>> +	 */
>> +	si = swp_swap_info(entry);
>> +	if (si->flags & SWP_SYNCHRONOUS_IO && __swap_count(entry) == 1)
>> +		skip_swapcache = true;
>> +
>>  	page = lookup_swap_cache(entry, vma, vmf->address);
>>  	swapcache = page;
>>  
>>  	if (!page) {
>> -		struct swap_info_struct *si = swp_swap_info(entry);
>> -
>> -		if (si->flags & SWP_SYNCHRONOUS_IO &&
>> -				__swap_count(entry) == 1) {
>> -			/* skip swapcache */
>> +		if (skip_swapcache) {
>>  			page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
>>  							vmf->address);
>>  			if (page) {
>> -- 
>> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
>> member of the Code Aurora Forum, hosted by The Linux Foundation
>>


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: fix the race between swapin_readahead and SWP_SYNCHRONOUS_IO path
  2019-09-10  8:22   ` Vinayak Menon
@ 2019-09-10 17:51     ` Minchan Kim
  2019-09-11 10:07       ` Vinayak Menon
  0 siblings, 1 reply; 16+ messages in thread
From: Minchan Kim @ 2019-09-10 17:51 UTC (permalink / raw)
  To: Vinayak Menon; +Cc: linux-mm

On Tue, Sep 10, 2019 at 01:52:36PM +0530, Vinayak Menon wrote:
> Hi Minchan,
> 
> 
> On 9/10/2019 4:56 AM, Minchan Kim wrote:
> > Hi Vinayak,
> >
> > On Fri, Aug 30, 2019 at 06:13:31PM +0530, Vinayak Menon wrote:
> >> The following race is observed due to which a processes faulting
> >> on a swap entry, finds the page neither in swapcache nor swap. This
> >> causes zram to give a zero filled page that gets mapped to the
> >> process, resulting in a user space crash later.
> >>
> >> Consider parent and child processes Pa and Pb sharing the same swap
> >> slot with swap_count 2. Swap is on zram with SWP_SYNCHRONOUS_IO set.
> >> Virtual address 'VA' of Pa and Pb points to the shared swap entry.
> >>
> >> Pa                                       Pb
> >>
> >> fault on VA                              fault on VA
> >> do_swap_page                             do_swap_page
> >> lookup_swap_cache fails                  lookup_swap_cache fails
> >>                                          Pb scheduled out
> >> swapin_readahead (deletes zram entry)
> >> swap_free (makes swap_count 1)
> >>                                          Pb scheduled in
> >>                                          swap_readpage (swap_count == 1)
> >>                                          Takes SWP_SYNCHRONOUS_IO path
> >>                                          zram enrty absent
> >>                                          zram gives a zero filled page
> >>
> >> Fix this by reading the swap_count before lookup_swap_cache, which conforms
> >> with the order in which page is added to swap cache and swap count is
> >> decremented in do_swap_page. In the race case above, this will let Pb take
> >> the readahead path and thus pick the proper page from swapcache.
> > Thanks for the report, Vinayak.
> >
> > It's a zram specific issue because it deallocates zram block
> > unconditionally once read IO is done. The expectation was that dirty
> > page is on the swap cache but with SWP_SYNCHRONOUS_IO, it's not true
> > any more so I want to resolve the issue in zram specific code, not
> > general one.
> 
> 
> Thanks for comments Minchan.
> 
> Trying to understand your comment better.  With SWP_SYNCHRONOUS_IO also, swap_slot_free_notify will
> 
> make sure that it deletes the entry only if the page is in swapcache. Even in the current issue case, a valid
> 
> entry is present in the swapcache at the time of issue (brought in by Pa). Its just that Pb missed it due to the
> 
> race and tried to read again from zram. So thinking whether it is an issue with zram deleting the entry, or
> 
> SWP_SYNCHRONOUS_IO failing to find the valid swapcache entry. There isn't actually a case seen where zram
> 
> entry is deleted unconditionally, with some process yet to reference the slot and page is not in swapcache.
> 
> 
> >
> > A idea in my mind is swap_slot_free_notify should check the slot
> > reference counter and if it's higher than 1, it shouldn't free the
> > slot until. What do you think about?
> 
> It seems fine to me except for the fact that it will delay zram entry deletion for shared slots, which
> 
> can be significant sometimes. Also, should we fix this path as the issue is with SWP_SYNCHRONOUS_IO missing

It's always trade-off between memory vs performance since it could hit
in swap cache. If it's shared page, it's likely to hit a cache next time
so we could get performance benefit.

Actually, swap_slot_free_notify is layering violation so I wanted to
replace it with discard hint in the long run so want to go the direction.

> 
> a valid swapcache entry ?
> 
> Can swapcache check be done like below, before taking the SWP_SYNCHRONOUS_IO path, as an alternative ?

With your approach, what prevent below scenario?

A                                                       B

                                            do_swap_page
                                            SWP_SYNCHRONOUS_IO && __swap_count == 1
shrink_page_list
add_to_swap
    swap_count = 2

..
..
do_swap_page
swap_read
    swap_slot_free_notify
        zram's slot will be removed
                                            page = alloc_page_vma
                                            swap_readpage <-- read zero


> 
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 063c0c1..a5ca05f 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -463,6 +463,7 @@ extern sector_t map_swap_page(struct page *, struct block_device **);
>  extern sector_t swapdev_block(int, pgoff_t);
>  extern int page_swapcount(struct page *);
>  extern int __swap_count(swp_entry_t entry);
> +extern bool __swap_has_cache(swp_entry_t entry);
>  extern int __swp_swapcount(swp_entry_t entry);
>  extern int swp_swapcount(swp_entry_t entry);
>  extern struct swap_info_struct *page_swap_info(struct page *);
> @@ -589,6 +590,11 @@ static inline int __swap_count(swp_entry_t entry)
>         return 0;
>  }
> 
> +static bool __swap_has_cache(swp_entry_t entry)
> +{
> +       return 0;
> +}
> +
>  static inline int __swp_swapcount(swp_entry_t entry)
>  {
>         return 0;
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index e0c232f..a13511f 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2778,7 +2778,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>                 struct swap_info_struct *si = swp_swap_info(entry);
> 
>                 if (si->flags & SWP_SYNCHRONOUS_IO &&
> -                               __swap_count(entry) == 1) {
> +                               __swap_count(entry) == 1 &&
> +                               !__swap_has_cache(entry)) {
>                         /* skip swapcache */
>                         page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
>                                                         vmf->address);
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 80445f4..2a1554a8 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1459,6 +1459,20 @@ int __swap_count(swp_entry_t entry)
>         return count;
>  }
> 
> +bool __swap_has_cache(swp_entry_t entry)
> +{
> +       struct swap_info_struct *si;
> +       pgoff_t offset = swp_offset(entry);
> +       bool has_cache  = false;
> +
> +       si = get_swap_device(entry);
> +       if (si) {
> +               has_cache = !!(si->swap_map[offset] & SWAP_HAS_CACHE);
> +               put_swap_device(si);
> +       }
> +       return has_cache;
> +}
> +
>  static int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry)
>  {
>         int count = 0;
> 
> 
> >
> >> Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
> >> ---
> >>  mm/memory.c | 21 ++++++++++++++++-----
> >>  1 file changed, 16 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/mm/memory.c b/mm/memory.c
> >> index e0c232f..22643aa 100644
> >> --- a/mm/memory.c
> >> +++ b/mm/memory.c
> >> @@ -2744,6 +2744,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>  	struct page *page = NULL, *swapcache;
> >>  	struct mem_cgroup *memcg;
> >>  	swp_entry_t entry;
> >> +	struct swap_info_struct *si;
> >> +	bool skip_swapcache = false;
> >>  	pte_t pte;
> >>  	int locked;
> >>  	int exclusive = 0;
> >> @@ -2771,15 +2773,24 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>  
> >>  
> >>  	delayacct_set_flag(DELAYACCT_PF_SWAPIN);
> >> +
> >> +	/*
> >> +	 * lookup_swap_cache below can fail and before the SWP_SYNCHRONOUS_IO
> >> +	 * check is made, another process can populate the swapcache, delete
> >> +	 * the swap entry and decrement the swap count. So decide on taking
> >> +	 * the SWP_SYNCHRONOUS_IO path before the lookup. In the event of the
> >> +	 * race described, the victim process will find a swap_count > 1
> >> +	 * and can then take the readahead path instead of SWP_SYNCHRONOUS_IO.
> >> +	 */
> >> +	si = swp_swap_info(entry);
> >> +	if (si->flags & SWP_SYNCHRONOUS_IO && __swap_count(entry) == 1)
> >> +		skip_swapcache = true;
> >> +
> >>  	page = lookup_swap_cache(entry, vma, vmf->address);
> >>  	swapcache = page;
> >>  
> >>  	if (!page) {
> >> -		struct swap_info_struct *si = swp_swap_info(entry);
> >> -
> >> -		if (si->flags & SWP_SYNCHRONOUS_IO &&
> >> -				__swap_count(entry) == 1) {
> >> -			/* skip swapcache */
> >> +		if (skip_swapcache) {
> >>  			page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
> >>  							vmf->address);
> >>  			if (page) {
> >> -- 
> >> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
> >> member of the Code Aurora Forum, hosted by The Linux Foundation
> >>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: fix the race between swapin_readahead and SWP_SYNCHRONOUS_IO path
  2019-09-10 17:51     ` Minchan Kim
@ 2019-09-11 10:07       ` Vinayak Menon
  2019-09-12 17:14         ` Minchan Kim
  0 siblings, 1 reply; 16+ messages in thread
From: Vinayak Menon @ 2019-09-11 10:07 UTC (permalink / raw)
  To: Minchan Kim; +Cc: linux-mm


On 9/10/2019 11:21 PM, Minchan Kim wrote:
> On Tue, Sep 10, 2019 at 01:52:36PM +0530, Vinayak Menon wrote:
>> Hi Minchan,
>>
>>
>> On 9/10/2019 4:56 AM, Minchan Kim wrote:
>>> Hi Vinayak,
>>>
>>> On Fri, Aug 30, 2019 at 06:13:31PM +0530, Vinayak Menon wrote:
>>>> The following race is observed due to which a processes faulting
>>>> on a swap entry, finds the page neither in swapcache nor swap. This
>>>> causes zram to give a zero filled page that gets mapped to the
>>>> process, resulting in a user space crash later.
>>>>
>>>> Consider parent and child processes Pa and Pb sharing the same swap
>>>> slot with swap_count 2. Swap is on zram with SWP_SYNCHRONOUS_IO set.
>>>> Virtual address 'VA' of Pa and Pb points to the shared swap entry.
>>>>
>>>> Pa                                       Pb
>>>>
>>>> fault on VA                              fault on VA
>>>> do_swap_page                             do_swap_page
>>>> lookup_swap_cache fails                  lookup_swap_cache fails
>>>>                                          Pb scheduled out
>>>> swapin_readahead (deletes zram entry)
>>>> swap_free (makes swap_count 1)
>>>>                                          Pb scheduled in
>>>>                                          swap_readpage (swap_count == 1)
>>>>                                          Takes SWP_SYNCHRONOUS_IO path
>>>>                                          zram enrty absent
>>>>                                          zram gives a zero filled page
>>>>
>>>> Fix this by reading the swap_count before lookup_swap_cache, which conforms
>>>> with the order in which page is added to swap cache and swap count is
>>>> decremented in do_swap_page. In the race case above, this will let Pb take
>>>> the readahead path and thus pick the proper page from swapcache.
>>> Thanks for the report, Vinayak.
>>>
>>> It's a zram specific issue because it deallocates zram block
>>> unconditionally once read IO is done. The expectation was that dirty
>>> page is on the swap cache but with SWP_SYNCHRONOUS_IO, it's not true
>>> any more so I want to resolve the issue in zram specific code, not
>>> general one.
>>
>> Thanks for comments Minchan.
>>
>> Trying to understand your comment better.  With SWP_SYNCHRONOUS_IO also, swap_slot_free_notify will
>>
>> make sure that it deletes the entry only if the page is in swapcache. Even in the current issue case, a valid
>>
>> entry is present in the swapcache at the time of issue (brought in by Pa). Its just that Pb missed it due to the
>>
>> race and tried to read again from zram. So thinking whether it is an issue with zram deleting the entry, or
>>
>> SWP_SYNCHRONOUS_IO failing to find the valid swapcache entry. There isn't actually a case seen where zram
>>
>> entry is deleted unconditionally, with some process yet to reference the slot and page is not in swapcache.
>>
>>
>>> A idea in my mind is swap_slot_free_notify should check the slot
>>> reference counter and if it's higher than 1, it shouldn't free the
>>> slot until. What do you think about?
>> It seems fine to me except for the fact that it will delay zram entry deletion for shared slots, which
>>
>> can be significant sometimes. Also, should we fix this path as the issue is with SWP_SYNCHRONOUS_IO missing
> It's always trade-off between memory vs performance since it could hit
> in swap cache. If it's shared page, it's likely to hit a cache next time
> so we could get performance benefit.
>
> Actually, swap_slot_free_notify is layering violation so I wanted to
> replace it with discard hint in the long run so want to go the direction.


Okay got it.


>
>> a valid swapcache entry ?
>>
>> Can swapcache check be done like below, before taking the SWP_SYNCHRONOUS_IO path, as an alternative ?
> With your approach, what prevent below scenario?
>
> A                                                       B
>
>                                             do_swap_page
>                                             SWP_SYNCHRONOUS_IO && __swap_count == 1


As shrink_page_list is picking the page from LRU and B is trying to read from swap simultaneously, I assume someone had read

the page from swap prior to B, when its swap_count was say 2 (for it to be reclaimed by shrink_page_list now)

If so, that read itself would have deleted the zram entry ? And the read page will be in swapcache and dirty ? In that case, with SWAP_HAS_CACHE

check in the patch, B will take readahead path. And shrink_page_list would attempt a pageout to zram, for the dirty page ?


> shrink_page_list
> add_to_swap
>     swap_count = 2
>
> ..
> ..
> do_swap_page
> swap_read
>     swap_slot_free_notify
>         zram's slot will be removed
>                                             page = alloc_page_vma
>                                             swap_readpage <-- read zero
>
>
>>
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index 063c0c1..a5ca05f 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -463,6 +463,7 @@ extern sector_t map_swap_page(struct page *, struct block_device **);
>>  extern sector_t swapdev_block(int, pgoff_t);
>>  extern int page_swapcount(struct page *);
>>  extern int __swap_count(swp_entry_t entry);
>> +extern bool __swap_has_cache(swp_entry_t entry);
>>  extern int __swp_swapcount(swp_entry_t entry);
>>  extern int swp_swapcount(swp_entry_t entry);
>>  extern struct swap_info_struct *page_swap_info(struct page *);
>> @@ -589,6 +590,11 @@ static inline int __swap_count(swp_entry_t entry)
>>         return 0;
>>  }
>>
>> +static bool __swap_has_cache(swp_entry_t entry)
>> +{
>> +       return 0;
>> +}
>> +
>>  static inline int __swp_swapcount(swp_entry_t entry)
>>  {
>>         return 0;
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index e0c232f..a13511f 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -2778,7 +2778,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>                 struct swap_info_struct *si = swp_swap_info(entry);
>>
>>                 if (si->flags & SWP_SYNCHRONOUS_IO &&
>> -                               __swap_count(entry) == 1) {
>> +                               __swap_count(entry) == 1 &&
>> +                               !__swap_has_cache(entry)) {
>>                         /* skip swapcache */
>>                         page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
>>                                                         vmf->address);
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index 80445f4..2a1554a8 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -1459,6 +1459,20 @@ int __swap_count(swp_entry_t entry)
>>         return count;
>>  }
>>
>> +bool __swap_has_cache(swp_entry_t entry)
>> +{
>> +       struct swap_info_struct *si;
>> +       pgoff_t offset = swp_offset(entry);
>> +       bool has_cache  = false;
>> +
>> +       si = get_swap_device(entry);
>> +       if (si) {
>> +               has_cache = !!(si->swap_map[offset] & SWAP_HAS_CACHE);
>> +               put_swap_device(si);
>> +       }
>> +       return has_cache;
>> +}
>> +
>>  static int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry)
>>  {
>>         int count = 0;
>>
>>
>>>> Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
>>>> ---
>>>>  mm/memory.c | 21 ++++++++++++++++-----
>>>>  1 file changed, 16 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index e0c232f..22643aa 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -2744,6 +2744,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>  	struct page *page = NULL, *swapcache;
>>>>  	struct mem_cgroup *memcg;
>>>>  	swp_entry_t entry;
>>>> +	struct swap_info_struct *si;
>>>> +	bool skip_swapcache = false;
>>>>  	pte_t pte;
>>>>  	int locked;
>>>>  	int exclusive = 0;
>>>> @@ -2771,15 +2773,24 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>>>  
>>>>  
>>>>  	delayacct_set_flag(DELAYACCT_PF_SWAPIN);
>>>> +
>>>> +	/*
>>>> +	 * lookup_swap_cache below can fail and before the SWP_SYNCHRONOUS_IO
>>>> +	 * check is made, another process can populate the swapcache, delete
>>>> +	 * the swap entry and decrement the swap count. So decide on taking
>>>> +	 * the SWP_SYNCHRONOUS_IO path before the lookup. In the event of the
>>>> +	 * race described, the victim process will find a swap_count > 1
>>>> +	 * and can then take the readahead path instead of SWP_SYNCHRONOUS_IO.
>>>> +	 */
>>>> +	si = swp_swap_info(entry);
>>>> +	if (si->flags & SWP_SYNCHRONOUS_IO && __swap_count(entry) == 1)
>>>> +		skip_swapcache = true;
>>>> +
>>>>  	page = lookup_swap_cache(entry, vma, vmf->address);
>>>>  	swapcache = page;
>>>>  
>>>>  	if (!page) {
>>>> -		struct swap_info_struct *si = swp_swap_info(entry);
>>>> -
>>>> -		if (si->flags & SWP_SYNCHRONOUS_IO &&
>>>> -				__swap_count(entry) == 1) {
>>>> -			/* skip swapcache */
>>>> +		if (skip_swapcache) {
>>>>  			page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
>>>>  							vmf->address);
>>>>  			if (page) {
>>>> -- 
>>>> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
>>>> member of the Code Aurora Forum, hosted by The Linux Foundation
>>>>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: fix the race between swapin_readahead and SWP_SYNCHRONOUS_IO path
  2019-09-11 10:07       ` Vinayak Menon
@ 2019-09-12 17:14         ` Minchan Kim
  2019-09-13  9:05           ` Vinayak Menon
  0 siblings, 1 reply; 16+ messages in thread
From: Minchan Kim @ 2019-09-12 17:14 UTC (permalink / raw)
  To: Vinayak Menon; +Cc: linux-mm

Hi Vinayak,

On Wed, Sep 11, 2019 at 03:37:23PM +0530, Vinayak Menon wrote:

< snip >

> >> Can swapcache check be done like below, before taking the SWP_SYNCHRONOUS_IO path, as an alternative ?
> > With your approach, what prevent below scenario?
> >
> > A                                                       B
> >
> >                                             do_swap_page
> >                                             SWP_SYNCHRONOUS_IO && __swap_count == 1
> 
> 
> As shrink_page_list is picking the page from LRU and B is trying to read from swap simultaneously, I assume someone had read
> 
> the page from swap prior to B, when its swap_count was say 2 (for it to be reclaimed by shrink_page_list now)

It could happen after B saw __swap_count == 1. Think about forking new process.
In that case, swap_count is 2 and the forked process will access the page(it
ends up freeing zram slot but the page would be swap cache. However, B process
doesn't know it).


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: fix the race between swapin_readahead and SWP_SYNCHRONOUS_IO path
  2019-09-12 17:14         ` Minchan Kim
@ 2019-09-13  9:05           ` Vinayak Menon
  2019-09-16 20:05             ` Minchan Kim
  0 siblings, 1 reply; 16+ messages in thread
From: Vinayak Menon @ 2019-09-13  9:05 UTC (permalink / raw)
  To: Minchan Kim; +Cc: linux-mm


On 9/12/2019 10:44 PM, Minchan Kim wrote:
> Hi Vinayak,
>
> On Wed, Sep 11, 2019 at 03:37:23PM +0530, Vinayak Menon wrote:
>
> < snip >
>
>>>> Can swapcache check be done like below, before taking the SWP_SYNCHRONOUS_IO path, as an alternative ?
>>> With your approach, what prevent below scenario?
>>>
>>> A                                                       B
>>>
>>>                                             do_swap_page
>>>                                             SWP_SYNCHRONOUS_IO && __swap_count == 1
>>
>> As shrink_page_list is picking the page from LRU and B is trying to read from swap simultaneously, I assume someone had read
>>
>> the page from swap prior to B, when its swap_count was say 2 (for it to be reclaimed by shrink_page_list now)
> It could happen after B saw __swap_count == 1. Think about forking new process.
> In that case, swap_count is 2 and the forked process will access the page(it
> ends up freeing zram slot but the page would be swap cache. However, B process
> doesn't know it).


Okay, so when B has read __swap_count == 1, it means that it has taken down_read on mmap_sem in fault path

already. This means fork will not be able to proceed which needs to have down_write on parent's mmap_sem ?




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: fix the race between swapin_readahead and SWP_SYNCHRONOUS_IO path
  2019-09-13  9:05           ` Vinayak Menon
@ 2019-09-16 20:05             ` Minchan Kim
  2019-09-17  5:38               ` Vinayak Menon
  0 siblings, 1 reply; 16+ messages in thread
From: Minchan Kim @ 2019-09-16 20:05 UTC (permalink / raw)
  To: Vinayak Menon; +Cc: linux-mm

Hi Vinayak,

On Fri, Sep 13, 2019 at 02:35:41PM +0530, Vinayak Menon wrote:
> 
> On 9/12/2019 10:44 PM, Minchan Kim wrote:
> > Hi Vinayak,
> >
> > On Wed, Sep 11, 2019 at 03:37:23PM +0530, Vinayak Menon wrote:
> >
> > < snip >
> >
> >>>> Can swapcache check be done like below, before taking the SWP_SYNCHRONOUS_IO path, as an alternative ?
> >>> With your approach, what prevent below scenario?
> >>>
> >>> A                                                       B
> >>>
> >>>                                             do_swap_page
> >>>                                             SWP_SYNCHRONOUS_IO && __swap_count == 1
> >>
> >> As shrink_page_list is picking the page from LRU and B is trying to read from swap simultaneously, I assume someone had read
> >>
> >> the page from swap prior to B, when its swap_count was say 2 (for it to be reclaimed by shrink_page_list now)
> > It could happen after B saw __swap_count == 1. Think about forking new process.
> > In that case, swap_count is 2 and the forked process will access the page(it
> > ends up freeing zram slot but the page would be swap cache. However, B process
> > doesn't know it).
> 
> 
> Okay, so when B has read __swap_count == 1, it means that it has taken down_read on mmap_sem in fault path
> 
> already. This means fork will not be able to proceed which needs to have down_write on parent's mmap_sem ?
> 

You are exactly right. However, I still believe better option to solve
the issue is to check swap_count and delte only if swap_count == 1
in swap_slot_free_notify because it's zram specific issue and more safe
without depending other lock scheme.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: fix the race between swapin_readahead and SWP_SYNCHRONOUS_IO path
  2019-09-16 20:05             ` Minchan Kim
@ 2019-09-17  5:38               ` Vinayak Menon
  2019-09-18  1:12                 ` Minchan Kim
  0 siblings, 1 reply; 16+ messages in thread
From: Vinayak Menon @ 2019-09-17  5:38 UTC (permalink / raw)
  To: Minchan Kim; +Cc: linux-mm


On 9/17/2019 1:35 AM, Minchan Kim wrote:
> Hi Vinayak,
>
> On Fri, Sep 13, 2019 at 02:35:41PM +0530, Vinayak Menon wrote:
>> On 9/12/2019 10:44 PM, Minchan Kim wrote:
>>> Hi Vinayak,
>>>
>>> On Wed, Sep 11, 2019 at 03:37:23PM +0530, Vinayak Menon wrote:
>>>
>>> < snip >
>>>
>>>>>> Can swapcache check be done like below, before taking the SWP_SYNCHRONOUS_IO path, as an alternative ?
>>>>> With your approach, what prevent below scenario?
>>>>>
>>>>> A                                                       B
>>>>>
>>>>>                                             do_swap_page
>>>>>                                             SWP_SYNCHRONOUS_IO && __swap_count == 1
>>>> As shrink_page_list is picking the page from LRU and B is trying to read from swap simultaneously, I assume someone had read
>>>>
>>>> the page from swap prior to B, when its swap_count was say 2 (for it to be reclaimed by shrink_page_list now)
>>> It could happen after B saw __swap_count == 1. Think about forking new process.
>>> In that case, swap_count is 2 and the forked process will access the page(it
>>> ends up freeing zram slot but the page would be swap cache. However, B process
>>> doesn't know it).
>>
>> Okay, so when B has read __swap_count == 1, it means that it has taken down_read on mmap_sem in fault path
>>
>> already. This means fork will not be able to proceed which needs to have down_write on parent's mmap_sem ?
>>
> You are exactly right. However, I still believe better option to solve
> the issue is to check swap_count and delte only if swap_count == 1
> in swap_slot_free_notify because it's zram specific issue and more safe
> without depending other lock scheme.


Sure. Let me know if you want me to post a patch for that.




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] mm: fix the race between swapin_readahead and SWP_SYNCHRONOUS_IO path
  2019-09-17  5:38               ` Vinayak Menon
@ 2019-09-18  1:12                 ` Minchan Kim
  0 siblings, 0 replies; 16+ messages in thread
From: Minchan Kim @ 2019-09-18  1:12 UTC (permalink / raw)
  To: Vinayak Menon; +Cc: linux-mm

On Tue, Sep 17, 2019 at 11:08:49AM +0530, Vinayak Menon wrote:
> 
> On 9/17/2019 1:35 AM, Minchan Kim wrote:
> > Hi Vinayak,
> >
> > On Fri, Sep 13, 2019 at 02:35:41PM +0530, Vinayak Menon wrote:
> >> On 9/12/2019 10:44 PM, Minchan Kim wrote:
> >>> Hi Vinayak,
> >>>
> >>> On Wed, Sep 11, 2019 at 03:37:23PM +0530, Vinayak Menon wrote:
> >>>
> >>> < snip >
> >>>
> >>>>>> Can swapcache check be done like below, before taking the SWP_SYNCHRONOUS_IO path, as an alternative ?
> >>>>> With your approach, what prevent below scenario?
> >>>>>
> >>>>> A                                                       B
> >>>>>
> >>>>>                                             do_swap_page
> >>>>>                                             SWP_SYNCHRONOUS_IO && __swap_count == 1
> >>>> As shrink_page_list is picking the page from LRU and B is trying to read from swap simultaneously, I assume someone had read
> >>>>
> >>>> the page from swap prior to B, when its swap_count was say 2 (for it to be reclaimed by shrink_page_list now)
> >>> It could happen after B saw __swap_count == 1. Think about forking new process.
> >>> In that case, swap_count is 2 and the forked process will access the page(it
> >>> ends up freeing zram slot but the page would be swap cache. However, B process
> >>> doesn't know it).
> >>
> >> Okay, so when B has read __swap_count == 1, it means that it has taken down_read on mmap_sem in fault path
> >>
> >> already. This means fork will not be able to proceed which needs to have down_write on parent's mmap_sem ?
> >>
> > You are exactly right. However, I still believe better option to solve
> > the issue is to check swap_count and delte only if swap_count == 1
> > in swap_slot_free_notify because it's zram specific issue and more safe
> > without depending other lock scheme.
> 
> 
> Sure. Let me know if you want me to post a patch for that.
> 

Please post a patch.
Thanks, Vinayak!


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2019-09-18  1:12 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-30 12:43 [PATCH] mm: fix the race between swapin_readahead and SWP_SYNCHRONOUS_IO path Vinayak Menon
2019-09-02 13:21 ` Michal Hocko
2019-09-03  6:13   ` Vinayak Menon
2019-09-03 11:41     ` Michal Hocko
2019-09-03 12:17       ` Vinayak Menon
2019-09-09  4:05         ` Vinayak Menon
2019-09-09 11:23           ` Michal Hocko
2019-09-09 23:26 ` Minchan Kim
2019-09-10  8:22   ` Vinayak Menon
2019-09-10 17:51     ` Minchan Kim
2019-09-11 10:07       ` Vinayak Menon
2019-09-12 17:14         ` Minchan Kim
2019-09-13  9:05           ` Vinayak Menon
2019-09-16 20:05             ` Minchan Kim
2019-09-17  5:38               ` Vinayak Menon
2019-09-18  1:12                 ` Minchan Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).