From: Mel Gorman <mel@csn.ul.ie>
To: Rik van Riel <riel@redhat.com>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
akpm@linux-foundation.org,
KOSAKI Motohiro <kosaki.motohiro@gmail.com>,
Johannes Weiner <hannes@cmpxchg.org>
Subject: Re: [PATCH -mm] make swapin readahead skip over holes
Date: Thu, 12 Jan 2012 14:33:01 +0000 [thread overview]
Message-ID: <20120112143301.GG3910@csn.ul.ie> (raw)
In-Reply-To: <4F0DE1CE.5000006@redhat.com>
On Wed, Jan 11, 2012 at 02:23:58PM -0500, Rik van Riel wrote:
> On 01/11/2012 11:51 AM, Mel Gorman wrote:
> >On Mon, Jan 09, 2012 at 06:10:23PM -0500, Rik van Riel wrote:
>
> >>+ get_swap_cluster(entry,&offset,&end_offset);
> >>+
> >>+ for (; offset<= end_offset ; offset++) {
> >> /* Ok, do the async read-ahead now */
> >> page = read_swap_cache_async(swp_entry(swp_type(entry), offset),
> >> gfp_mask, vma, addr);
> >> if (!page)
> >>- break;
> >>+ continue;
> >> page_cache_release(page);
> >> }
> >
> >For a heavily fragmented swap file, this will result in more IO and
> >the gamble is that pages nearby are needed soon. You say your virtual
> >machines swapin faster and that does not surprise me. I also expect
> >they need the data so it's a net win.
>
> More IO operations, yes. However, IO operations from nearby
> blocks can often be done without incurring extra disk seeks,
> because the disk head is already in the right place.
>
I understand that, the result still has to go somewhere in memory
though. The cost in terms of IO speed might be effectively 0 but it
still consumes memory resources.
> This seems to be born out by the fact that I saw swapin
> rates increase from maybe 200-300kB/s to 5-15MB/s...
>
> Even on some SSDs it could avoid some bank switches, though
> of course there I would expect the effect to be much less
> pronounced.
>
> >There is an possibility that under memory pressure that swapping in
> >more pages will cause more memory pressure (increased swapin causing
> >clean page cache discards and pageout) and be an overall loss. This may
> >be a net loss in some cases such as where the working set size is just
> >over physical memory and the increased swapin causes a problem. I doubt
> >this case is common but it is worth bearing in mind if future bug
> >reports complain about increased swap activity.
>
> True, there may be workloads that benefit from a smaller
> page-cluster. The fact that the recently swapped in pages
> are all put on the inactive anon list should help protect
> the working set, too.
>
True to some extent.
> Another alternative may be a time based decision. If we
> have swapped something out recently, go with a less
> aggressive swapin readahead.
>
Maybe, but while I know I brought up the problem of swapin might
increase swapout, it would be best to have an example of how that can
happen before looking at heuristics.
> That would automatically give us fast swapin readahead
> when in "memory hog just exited, let the system recover"
> mode, and conservative swapin readahead in your situation.
>
> However, that could still hurt badly if the system is just
> moving the working set from one part of a program to another.
>
> I suspect we will be faster off by having faster swap IO,
> which this patch seems to provide.
>
More than likely.
> >>- si = swap_info[swp_type(entry)];
> >>- target = swp_offset(entry);
> >>- base = (target>> our_page_cluster)<< our_page_cluster;
> >>- end = base + (1<< our_page_cluster);
> >>- if (!base) /* first page is swap header */
> >>- base++;
>
> >>+ si = swap_info[swp_type(entry)];
> >>+ /* Round the begin down to a page_cluster boundary. */
> >>+ offset = (offset>> page_cluster)<< page_cluster;
> >
> >Minor nit but it would feel more natural to me to see
> >
> >offset& ~((1<< page_cluster) - 1)
> >
> >but I understand that you are reusing the existing code.
>
> Sure, I can do that.
>
> While I'm there, I can also add that if (!base) base++
> thing back in :)
>
Whoops, I missed that!
> >>+ *begin = offset;
> >>+ /* Round the end up, but not beyond the end of the swap device. */
> >>+ offset = offset + (1<< page_cluster);
> >>+ if (offset> si->max)
> >>+ offset = si->max;
> >>+ *end = offset;
> >> spin_unlock(&swap_lock);
> >>-
> >>- /*
> >>- * Indicate starting offset, and return number of pages to get:
> >>- * if only 1, say 0, since there's then no readahead to be done.
> >>- */
> >>- *offset = ++toff;
> >>- return nr_pages? ++nr_pages: 0;
> >> }
> >
> >This section deletes code which is nice but there is a
> >problem. Your changelog says that this is duplicating the effort of
> >read_swap_cache_async() which is true but what it does is
> >
> >1. a swap cache lookup which will probably fail
> >2. alloc_page_vma()
> >3. radix_tree_preload()
> >4. swapcache_prepare
> > - calls __swap_duplicate()
> > - finds the hole, bails out
> >
> >That's a lot of work before the hole is found. Would it be worth
> >doing a racy check in swapin_readahead without swap lock held before
> >calling read_swap_cache_async()?
>
> The problem is that without the swap_lock held, the swap_info
> struct may disappear completely because of the swapin_readahead
> happening concurrently with a swapoff.
>
hmm, I considered that when I wrote the suggestion and was making
an assumption that the presense of swap pages on the area would
effectively pin the swap_info struct.
I did think of the race that swapoff is operating on the very last page
that swapin is reading and removes the map at the same time. We make a
racy check against effectively random memory and one of three results
might happen
1. DEBUG_PAGE_ALLOC was enabled and we blow up, ok, this one is bad if
somewhat corner case
2. False positive count - we fall through and try to read the page,
find the swap page no longer exists and bail gracefully
3. False zero count - we do not readin the swap page. It's not a big
deal if swapin_readahead fails to read in a page it should have
(swapoff has brought it in anyway).
Basically, I did not see any major downside with doing a racy check. If
1 above is a serious problem, then yes, we have to take the swap_lock
but that's still cheaper than allocating needlessly.
> I suspect that the CPU time spent doing 1-4 above will be
> negligible compared to the amount of time spent doing disk IO,
Negligible in comparison to the IO sure, but it still seems like we
could be doing a lot of allocation work (alloc_page_vma and
radix_tree_preload) for no good reason.
> but if there turns out to be a problem it should be possible
> to move the swap hole identification closer to the top of
> swap_cache_read_async().
>
There is that, it'll be tricky to spot that we're needless allocating
pages though. Not the end of the world, what you have still improves
things.
--
Mel Gorman
SUSE Labs
prev parent reply other threads:[~2012-01-12 14:33 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-01-09 23:10 [PATCH -mm] make swapin readahead skip over holes Rik van Riel
2012-01-09 23:49 ` KOSAKI Motohiro
2012-01-10 3:09 ` Rik van Riel
2012-01-11 7:14 ` KOSAKI Motohiro
2012-01-11 8:01 ` KOSAKI Motohiro
2012-01-11 8:05 ` KOSAKI Motohiro
2012-01-11 14:11 ` John Stoffel
2012-01-11 19:17 ` Christoph Lameter
2012-01-11 21:03 ` John Stoffel
2012-01-11 22:25 ` KOSAKI Motohiro
2012-01-11 23:15 ` Andrew Morton
2012-01-11 16:51 ` Mel Gorman
2012-01-11 19:23 ` Rik van Riel
2012-01-12 14:33 ` Mel Gorman [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20120112143301.GG3910@csn.ul.ie \
--to=mel@csn.ul.ie \
--cc=akpm@linux-foundation.org \
--cc=hannes@cmpxchg.org \
--cc=kosaki.motohiro@gmail.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=riel@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).