From: Minchan Kim <minchan@kernel.org>
To: Vinayak Menon <vinmenon@codeaurora.org>
Cc: linux-mm@kvack.org
Subject: Re: [PATCH] mm: fix the race between swapin_readahead and SWP_SYNCHRONOUS_IO path
Date: Tue, 10 Sep 2019 10:51:16 -0700 [thread overview]
Message-ID: <20190910175116.GB39783@google.com> (raw)
In-Reply-To: <9df3bb51-2094-c849-8171-dce6784e1e70@codeaurora.org>
On Tue, Sep 10, 2019 at 01:52:36PM +0530, Vinayak Menon wrote:
> Hi Minchan,
>
>
> On 9/10/2019 4:56 AM, Minchan Kim wrote:
> > Hi Vinayak,
> >
> > On Fri, Aug 30, 2019 at 06:13:31PM +0530, Vinayak Menon wrote:
> >> The following race is observed due to which a processes faulting
> >> on a swap entry, finds the page neither in swapcache nor swap. This
> >> causes zram to give a zero filled page that gets mapped to the
> >> process, resulting in a user space crash later.
> >>
> >> Consider parent and child processes Pa and Pb sharing the same swap
> >> slot with swap_count 2. Swap is on zram with SWP_SYNCHRONOUS_IO set.
> >> Virtual address 'VA' of Pa and Pb points to the shared swap entry.
> >>
> >> Pa Pb
> >>
> >> fault on VA fault on VA
> >> do_swap_page do_swap_page
> >> lookup_swap_cache fails lookup_swap_cache fails
> >> Pb scheduled out
> >> swapin_readahead (deletes zram entry)
> >> swap_free (makes swap_count 1)
> >> Pb scheduled in
> >> swap_readpage (swap_count == 1)
> >> Takes SWP_SYNCHRONOUS_IO path
> >> zram enrty absent
> >> zram gives a zero filled page
> >>
> >> Fix this by reading the swap_count before lookup_swap_cache, which conforms
> >> with the order in which page is added to swap cache and swap count is
> >> decremented in do_swap_page. In the race case above, this will let Pb take
> >> the readahead path and thus pick the proper page from swapcache.
> > Thanks for the report, Vinayak.
> >
> > It's a zram specific issue because it deallocates zram block
> > unconditionally once read IO is done. The expectation was that dirty
> > page is on the swap cache but with SWP_SYNCHRONOUS_IO, it's not true
> > any more so I want to resolve the issue in zram specific code, not
> > general one.
>
>
> Thanks for comments Minchan.
>
> Trying to understand your comment better. With SWP_SYNCHRONOUS_IO also, swap_slot_free_notify will
>
> make sure that it deletes the entry only if the page is in swapcache. Even in the current issue case, a valid
>
> entry is present in the swapcache at the time of issue (brought in by Pa). Its just that Pb missed it due to the
>
> race and tried to read again from zram. So thinking whether it is an issue with zram deleting the entry, or
>
> SWP_SYNCHRONOUS_IO failing to find the valid swapcache entry. There isn't actually a case seen where zram
>
> entry is deleted unconditionally, with some process yet to reference the slot and page is not in swapcache.
>
>
> >
> > A idea in my mind is swap_slot_free_notify should check the slot
> > reference counter and if it's higher than 1, it shouldn't free the
> > slot until. What do you think about?
>
> It seems fine to me except for the fact that it will delay zram entry deletion for shared slots, which
>
> can be significant sometimes. Also, should we fix this path as the issue is with SWP_SYNCHRONOUS_IO missing
It's always trade-off between memory vs performance since it could hit
in swap cache. If it's shared page, it's likely to hit a cache next time
so we could get performance benefit.
Actually, swap_slot_free_notify is layering violation so I wanted to
replace it with discard hint in the long run so want to go the direction.
>
> a valid swapcache entry ?
>
> Can swapcache check be done like below, before taking the SWP_SYNCHRONOUS_IO path, as an alternative ?
With your approach, what prevent below scenario?
A B
do_swap_page
SWP_SYNCHRONOUS_IO && __swap_count == 1
shrink_page_list
add_to_swap
swap_count = 2
..
..
do_swap_page
swap_read
swap_slot_free_notify
zram's slot will be removed
page = alloc_page_vma
swap_readpage <-- read zero
>
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 063c0c1..a5ca05f 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -463,6 +463,7 @@ extern sector_t map_swap_page(struct page *, struct block_device **);
> extern sector_t swapdev_block(int, pgoff_t);
> extern int page_swapcount(struct page *);
> extern int __swap_count(swp_entry_t entry);
> +extern bool __swap_has_cache(swp_entry_t entry);
> extern int __swp_swapcount(swp_entry_t entry);
> extern int swp_swapcount(swp_entry_t entry);
> extern struct swap_info_struct *page_swap_info(struct page *);
> @@ -589,6 +590,11 @@ static inline int __swap_count(swp_entry_t entry)
> return 0;
> }
>
> +static bool __swap_has_cache(swp_entry_t entry)
> +{
> + return 0;
> +}
> +
> static inline int __swp_swapcount(swp_entry_t entry)
> {
> return 0;
>
> diff --git a/mm/memory.c b/mm/memory.c
> index e0c232f..a13511f 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2778,7 +2778,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> struct swap_info_struct *si = swp_swap_info(entry);
>
> if (si->flags & SWP_SYNCHRONOUS_IO &&
> - __swap_count(entry) == 1) {
> + __swap_count(entry) == 1 &&
> + !__swap_has_cache(entry)) {
> /* skip swapcache */
> page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
> vmf->address);
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 80445f4..2a1554a8 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1459,6 +1459,20 @@ int __swap_count(swp_entry_t entry)
> return count;
> }
>
> +bool __swap_has_cache(swp_entry_t entry)
> +{
> + struct swap_info_struct *si;
> + pgoff_t offset = swp_offset(entry);
> + bool has_cache = false;
> +
> + si = get_swap_device(entry);
> + if (si) {
> + has_cache = !!(si->swap_map[offset] & SWAP_HAS_CACHE);
> + put_swap_device(si);
> + }
> + return has_cache;
> +}
> +
> static int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry)
> {
> int count = 0;
>
>
> >
> >> Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
> >> ---
> >> mm/memory.c | 21 ++++++++++++++++-----
> >> 1 file changed, 16 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/mm/memory.c b/mm/memory.c
> >> index e0c232f..22643aa 100644
> >> --- a/mm/memory.c
> >> +++ b/mm/memory.c
> >> @@ -2744,6 +2744,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >> struct page *page = NULL, *swapcache;
> >> struct mem_cgroup *memcg;
> >> swp_entry_t entry;
> >> + struct swap_info_struct *si;
> >> + bool skip_swapcache = false;
> >> pte_t pte;
> >> int locked;
> >> int exclusive = 0;
> >> @@ -2771,15 +2773,24 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>
> >>
> >> delayacct_set_flag(DELAYACCT_PF_SWAPIN);
> >> +
> >> + /*
> >> + * lookup_swap_cache below can fail and before the SWP_SYNCHRONOUS_IO
> >> + * check is made, another process can populate the swapcache, delete
> >> + * the swap entry and decrement the swap count. So decide on taking
> >> + * the SWP_SYNCHRONOUS_IO path before the lookup. In the event of the
> >> + * race described, the victim process will find a swap_count > 1
> >> + * and can then take the readahead path instead of SWP_SYNCHRONOUS_IO.
> >> + */
> >> + si = swp_swap_info(entry);
> >> + if (si->flags & SWP_SYNCHRONOUS_IO && __swap_count(entry) == 1)
> >> + skip_swapcache = true;
> >> +
> >> page = lookup_swap_cache(entry, vma, vmf->address);
> >> swapcache = page;
> >>
> >> if (!page) {
> >> - struct swap_info_struct *si = swp_swap_info(entry);
> >> -
> >> - if (si->flags & SWP_SYNCHRONOUS_IO &&
> >> - __swap_count(entry) == 1) {
> >> - /* skip swapcache */
> >> + if (skip_swapcache) {
> >> page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
> >> vmf->address);
> >> if (page) {
> >> --
> >> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
> >> member of the Code Aurora Forum, hosted by The Linux Foundation
> >>
next prev parent reply other threads:[~2019-09-10 17:51 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-08-30 12:43 [PATCH] mm: fix the race between swapin_readahead and SWP_SYNCHRONOUS_IO path Vinayak Menon
2019-09-02 13:21 ` Michal Hocko
2019-09-03 6:13 ` Vinayak Menon
2019-09-03 11:41 ` Michal Hocko
2019-09-03 12:17 ` Vinayak Menon
2019-09-09 4:05 ` Vinayak Menon
2019-09-09 11:23 ` Michal Hocko
2019-09-09 23:26 ` Minchan Kim
2019-09-10 8:22 ` Vinayak Menon
2019-09-10 17:51 ` Minchan Kim [this message]
2019-09-11 10:07 ` Vinayak Menon
2019-09-12 17:14 ` Minchan Kim
2019-09-13 9:05 ` Vinayak Menon
2019-09-16 20:05 ` Minchan Kim
2019-09-17 5:38 ` Vinayak Menon
2019-09-18 1:12 ` Minchan Kim
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190910175116.GB39783@google.com \
--to=minchan@kernel.org \
--cc=linux-mm@kvack.org \
--cc=vinmenon@codeaurora.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).