All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrea Righi <andrea.righi@canonical.com>
To: "Huang, Ying" <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Minchan Kim <minchan@kernel.org>,
	Anchal Agarwal <anchalag@amazon.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v2] mm: swap: use fixed-size readahead during swapoff
Date: Thu, 16 Apr 2020 19:21:58 +0200	[thread overview]
Message-ID: <20200416172158.GA3352@xps-13> (raw)
In-Reply-To: <871rooz2gs.fsf@yhuang-dev.intel.com>

On Thu, Apr 16, 2020 at 08:41:39AM +0800, Huang, Ying wrote:
> Andrea Righi <andrea.righi@canonical.com> writes:
> 
> > On Wed, Apr 15, 2020 at 09:32:47AM +0200, Andrea Righi wrote:
> >> On Wed, Apr 15, 2020 at 10:37:00AM +0800, Huang, Ying wrote:
> >> > Andrea Righi <andrea.righi@canonical.com> writes:
> >> > 
> >> > > On Tue, Apr 14, 2020 at 09:31:24AM +0800, Huang, Ying wrote:
> >> > >> Andrea Righi <andrea.righi@canonical.com> writes:
> >> > >> 
> >> > >> > On Mon, Apr 13, 2020 at 09:00:34PM +0800, Huang, Ying wrote:
> >> > >> >> Andrea Righi <andrea.righi@canonical.com> writes:
> >> > >> >> 
> >> > >> >> [snip]
> >> > >> >> 
> >> > >> >> > diff --git a/mm/swap_state.c b/mm/swap_state.c
> >> > >> >> > index ebed37bbf7a3..c71abc8df304 100644
> >> > >> >> > --- a/mm/swap_state.c
> >> > >> >> > +++ b/mm/swap_state.c
> >> > >> >> > @@ -20,6 +20,7 @@
> >> > >> >> >  #include <linux/migrate.h>
> >> > >> >> >  #include <linux/vmalloc.h>
> >> > >> >> >  #include <linux/swap_slots.h>
> >> > >> >> > +#include <linux/oom.h>
> >> > >> >> >  #include <linux/huge_mm.h>
> >> > >> >> >  
> >> > >> >> >  #include <asm/pgtable.h>
> >> > >> >> > @@ -507,6 +508,14 @@ static unsigned long swapin_nr_pages(unsigned long offset)
> >> > >> >> >  	max_pages = 1 << READ_ONCE(page_cluster);
> >> > >> >> >  	if (max_pages <= 1)
> >> > >> >> >  		return 1;
> >> > >> >> > +	/*
> >> > >> >> > +	 * If current task is using too much memory or swapoff is running
> >> > >> >> > +	 * simply use the max readahead size. Since we likely want to load a
> >> > >> >> > +	 * lot of pages back into memory, using a fixed-size max readhaead can
> >> > >> >> > +	 * give better performance in this case.
> >> > >> >> > +	 */
> >> > >> >> > +	if (oom_task_origin(current))
> >> > >> >> > +		return max_pages;
> >> > >> >> >  
> >> > >> >> >  	hits = atomic_xchg(&swapin_readahead_hits, 0);
> >> > >> >> >  	pages = __swapin_nr_pages(prev_offset, offset, hits, max_pages,
> >> > >> >> 
> >> > >> >> Thinks this again.  If my understanding were correct, the accessing
> >> > >> >> pattern during swapoff is sequential, why swap readahead doesn't work?
> >> > >> >> If so, can you root cause that firstly?
> >> > >> >
> >> > >> > Theoretically if the pattern is sequential the current heuristic should
> >> > >> > already select a big readahead size, but apparently it's not doing that.
> >> > >> >
> >> > >> > I'll repeat my tests tracing the readahead size during swapoff to see
> >> > >> > exactly what's going on here.
> >> > >> 
> >> > >> I haven't verify it.  It may be helpful to call lookup_swap_cache()
> >> > >> before swapin_readahead() in unuse_pte_range().  The theory behind it is
> >> > >> to update the swap readahead statistics via lookup_swap_cache().
> >> > >
> >> > > I did more tests trying to collect some useful information.
> >> > >
> >> > > In particular I've been focusing at tracing the distribution of the
> >> > > values returned by swapin_nr_pages() in different scenarios.
> >> > >
> >> > > To do so I made swapin_nr_pages() trace-able and I used the following
> >> > > bcc command to measure the distrubution of the returned values:
> >> > >
> >> > >  # argdist-bpfcc -c -C 'r::swapin_nr_pages(unsigned long offset):unsigned long:$retval'
> >> > >
> >> > > I've collected this metric in the following scenarios:
> >> > >   - 5.6 vanilla
> >> > >   - 5.6 + lookup_swap_cache() before swapin_readahead() in
> >> > >     unuse_pte_range()
> >> > >   - 5.6 + atomic_inc(&swapin_readahead_hits) before swapin_readahead()
> >> > >     in unuse_pte_range()
> >> > >   - 5.6 + swapin_readahead_hits=last_readahead_pages (in the atomic way)
> >> > >     before swapin_readahead() in unuse_pte_range()
> >> > >
> >> > > Each kernel has been tested both with swappiness=0 and swappiness=60.
> >> > > Results are pretty much identical changing the swappiness, so I'm just
> >> > > reporting the default case here (w/ swappiness=60).
> >> > >
> >> > > Result
> >> > > ======
> >> > >
> >> > >  = swapoff performance (elapsed time) =
> >> > >
> >> > >  vanilla                 22.09s
> >> > >  lookup_swap_cache()     23.87s
> >> > >  hits++                  16.10s
> >> > >  hits=last_ra_pages       8.81s
> >> > >
> >> > >  = swapin_nr_pages() $retval distribution =
> >> > >
> >> > > 5.6 vanilla:
> >> > > r::swapin_nr_pages(unsigned long offset):unsigned long:$retval
> >> > > 	COUNT      EVENT
> >> > > 	36948      $retval = 8
> >> > > 	44151      $retval = 4
> >> > > 	49290      $retval = 1
> >> > > 	527771     $retval = 2
> >> > >
> >> > > 5.6 lookup_swap_cache() before swapin_readahead():
> >> > > r::swapin_nr_pages(unsigned long offset):unsigned long:$retval
> >> > > 	COUNT      EVENT
> >> > > 	13093      $retval = 1
> >> > > 	56703      $retval = 8
> >> > > 	123067     $retval = 2
> >> > > 	366118     $retval = 4
> >> > >
> >> > > 5.6 atomic_inc(&swapin_readahead_hits) before swapin_readahead():
> >> > > r::swapin_nr_pages(unsigned long offset):unsigned long:$retval
> >> > > 	COUNT      EVENT
> >> > > 	2589       $retval = 1
> >> > > 	8016       $retval = 2
> >> > > 	40021      $retval = 8
> >> > > 	566038     $retval = 4
> >> > >
> >> > > 5.6 swapin_readahead_hits=last_readahead_pages before swapin_readahead():
> >> > > r::swapin_nr_pages(unsigned long offset):unsigned long:$retval
> >> > > 	COUNT      EVENT
> >> > > 	785        $retval = 2
> >> > > 	1072       $retval = 1
> >> > > 	21844      $retval = 4
> >> > > 	644168     $retval = 8
> >> > >
> >> > > In the vanilla case, the readahead heuristic seems to choose 2 pages
> >> > > most of the time. This is because we are not properly considering the
> >> > > hits (hits are always 0 in the swapoff code path) and, as you correctly
> >> > > pointed out, we can fix this by calling lookup_swap_cache() in
> >> > > unuse_pte_range() before calling swapin_readahead().
> >> > >
> >> > > With this change the distribution of the readahead size moves more
> >> > > toward 4 pages, but we still have some 2s. That looks good, however it
> >> > > doesn't seem to speed up swapoff very much... maybe because calling
> >> > > lookup_swap_cache() introduces a small overhead? (still need to
> >> > > investigate about this theory).
> >> > >
> >> > > In the next test I've tried to always increment hits by 1 before calling
> >> > > swapin_readahead() in unuse_pte_range(). This is basically cheating,
> >> > > because I'm faking the hit ratio, forcing the heuristic to use a larger
> >> > > readahead size; in fact, the readahead size moves even more toward 4
> >> > > pages and swapoff performance are a little better now.
> >> > >
> >> > > Pushing even more the "cheating" I can pretend that the previous
> >> > > readahead was all hits (swapin_readahead_hits=last_readahead_pages), so
> >> > > I'm forcing the heuristic to move toward the max size and keep using it.
> >> > > The result here is pretty much identical to my fixed-size patch, because
> >> > > swapin_nr_pages() returns the max readahead size pretty much all the
> >> > > time during swapoff (8 pages or, more in general, vm.page-cluster).
> >> > >
> >> > > Personally I don't like very much forcing the heuristic in this way,
> >> > > it'd be nice if it would just work by accounting the proper hit ratio
> >> > > (so just by adding lookup_swap_cache() as you correctly suggested), but
> >> > > this solution doesn't seem to improve performance in reality. For this
> >> > > reason I still think we should consider the swapoff scenario like a
> >> > > special one and somehow bypass the readahead heuristic and always return
> >> > > the max readahead size.
> >> > >
> >> > > Looking at the hits of the previous step in the swapoff case just
> >> > > doesn't work, because we may have some misses, but they will become hits
> >> > > very soon, since we are reading all the swapped out pages back into
> >> > > memory. This is why using the max readahead size gives better
> >> > > swapoff performance.
> >> > >
> >> > > What do you think?
> >> > 
> >> > >From your description, it appears that you are using cluster readahead
> >> > instead of vma readahead.  Can you verify this via,
> >> > 
> >> > # cat /sys/kernel/mm/swap/vma_ra_enabled
> >> 
> >> # cat /sys/kernel/mm/swap/vma_ra_enabled
> >> true
> >> 
> >> However, it's still using the cluster readahead because I'm using a
> >> swap file and nr_rotate_swap=1, so, according to the following, it's
> >> never using the vma readahead:
> >> 
> >>  static inline bool swap_use_vma_readahead(void)
> >>  {
> >>          return READ_ONCE(enable_vma_readahead) && !atomic_read(&nr_rotate_swap);
> >>  }
> >> 
> >> I'll investigate more on this, I think there's no reason to prevent the
> >> usage of vma readahead if the underlying device is non-rotational.
> >
> > Few more details about this.
> >
> > Even if I have vma_ra_enabled (in /sys/kernel/mm/swap/vma_ra_enabled) it
> > looks like my virtio_blk device is considered rotational by default:
> >
> > $ cat /sys/block/vda/queue/rotational
> > 1
> >
> > Therefore the vma readahead is not used. If I change this to
> > "rotational" then the vma readahead is used; this is confirmed by the
> > fact that swapin_nr_pages isn't called anymore:
> >
> > r::swapin_nr_pages(unsigned long offset):unsigned long:$retval
> > 	COUNT      EVENT
> > 	13         $retval = 1
> > 	18         $retval = 2
> > 	23         $retval = 4
> > 	29         $retval = 8
> >
> > swapoff time: 16.44s
> >
> > In terms of swapoff performance vma readahead works better than the
> > vanilla cluster readahead (~21s), but the "improved" cluster readahead
> > (with lookup_swap_cache() in unuse_pte_range()) still gives slightly
> > better results (~12s).
> 
> Thanks for testing.  Can you check the return value of
> __swapin_nr_pages() to verify whether vma readahead works as expected?

This is the vanilla kernel w/ vma readahead enabled:

r::__swapin_nr_pages():unsigned int:$retval
        COUNT      EVENT
        1286       $retval = 1
        6516       $retval = 4
        81252      $retval = 8
        260815     $retval = 2

swapoff time: real 18.20

And this is the kernel with the lookup_swap_cache() patch and vma
readahead enabled:

r::__swapin_nr_pages():unsigned int:$retval
        COUNT      EVENT
        321        $retval = 2
        509        $retval = 1
        3017       $retval = 4
        124090     $retval = 8

swapoff time: 15.33s

So, vma readahead seems to work as expected and swapoff now is faster
also with vma readahead enabled.

-Andrea

  reply	other threads:[~2020-04-16 17:22 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-13 11:18 [PATCH v2] mm: swap: use fixed-size readahead during swapoff Andrea Righi
2020-04-13 13:00 ` Huang, Ying
2020-04-13 13:00   ` Huang, Ying
2020-04-13 13:31   ` Andrea Righi
2020-04-14  1:31     ` Huang, Ying
2020-04-14  1:31       ` Huang, Ying
2020-04-14 13:05       ` Andrea Righi
2020-04-15  2:37         ` Huang, Ying
2020-04-15  2:37           ` Huang, Ying
2020-04-15  7:32           ` Andrea Righi
2020-04-15  7:44             ` Huang, Ying
2020-04-15  7:44               ` Huang, Ying
2020-04-15  9:19               ` Andrea Righi
2020-04-16  0:44                 ` Huang, Ying
2020-04-16  0:44                   ` Huang, Ying
2020-04-15 12:00             ` Andrea Righi
2020-04-16  0:41               ` Huang, Ying
2020-04-16  0:41                 ` Huang, Ying
2020-04-16 17:21                 ` Andrea Righi [this message]
2020-04-13 13:13 ` Huang, Ying
2020-04-13 13:13   ` Huang, Ying
2020-04-13 13:26   ` Andrea Righi
2020-04-14  1:45     ` Huang, Ying
2020-04-14  1:45       ` Huang, Ying

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200416172158.GA3352@xps-13 \
    --to=andrea.righi@canonical.com \
    --cc=akpm@linux-foundation.org \
    --cc=anchalag@amazon.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=minchan@kernel.org \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.