linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm: readahead: do not cap readahead() and MADV_WILLNEED
@ 2016-02-24  1:38 Johannes Weiner
  2016-02-24  2:34 ` Linus Torvalds
  0 siblings, 1 reply; 4+ messages in thread
From: Johannes Weiner @ 2016-02-24  1:38 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds; +Cc: linux-mm, linux-kernel, kernel-team

All readahead is currently capped to a maximum of the device readahead
limit, which defaults to 128k. For heuristics-based readahead this
makes perfect sense, too, but unfortunately the limit is also applied
to the explicit readahead() or madvise(MADV_WILLNEED) syscalls, and
128k is an awfully low limit, particularly for bigger machines. It's
not unreasonable for a user on a 100G machine to say, read this 1G
file, and read it now, I'm going to access the whole thing shortly.

Since both readahead() and MADV_WILLNEED take an explicit length
parameter, it seems weird to truncate that request quietly. Just do
what the user asked for and leave the limiting to the heuristics.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/readahead.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index 20e58e8..6d182db 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -212,7 +212,6 @@ int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
 	if (unlikely(!mapping->a_ops->readpage && !mapping->a_ops->readpages))
 		return -EINVAL;
 
-	nr_to_read = min(nr_to_read, inode_to_bdi(mapping->host)->ra_pages);
 	while (nr_to_read) {
 		int err;
 
@@ -485,6 +484,7 @@ void page_cache_sync_readahead(struct address_space *mapping,
 
 	/* be dumb */
 	if (filp && (filp->f_mode & FMODE_RANDOM)) {
+		req_size = min(req_size, inode_to_bdi(mapping->host)->ra_pages);
 		force_page_cache_readahead(mapping, filp, offset, req_size);
 		return;
 	}
-- 
2.7.1

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] mm: readahead: do not cap readahead() and MADV_WILLNEED
  2016-02-24  1:38 [PATCH] mm: readahead: do not cap readahead() and MADV_WILLNEED Johannes Weiner
@ 2016-02-24  2:34 ` Linus Torvalds
  2016-02-29 19:41   ` Johannes Weiner
  0 siblings, 1 reply; 4+ messages in thread
From: Linus Torvalds @ 2016-02-24  2:34 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, linux-mm, Linux Kernel Mailing List, kernel-team

On Tue, Feb 23, 2016 at 5:38 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> Since both readahead() and MADV_WILLNEED take an explicit length
> parameter, it seems weird to truncate that request quietly. Just do
> what the user asked for and leave the limiting to the heuristics.

Why the hell do people continue to try to push these kinds of changes?

Just read the code that you changed: the size_t is a trivially
user-supplied number that any user can set to any value. And you just
removed all sanity checking for it, so now people can basically do
unlimited IO that is not even killable.

Seriously.

[ It's slightly saved by the fact that you have to find a file that is
large and readable, which will happily limit things a lot on practice,
but it's still something we should worry about.

  It is also at least partially mitigated by the fact that read-ahead
allocations use __GFP_NORETRY and hopefully thus don't do the worst
kind of damage to the memory management, but I'd hate to have to rely
on that ]

At least with regular read() system calls, we try to use killable IO
(well - some filesystems might not do it, but the core pagecache
functions support it), and we limit the maximum IO size even if that
limit happens to be pretty high.

You just completely removed that limiting for the readahead code. So
now you can pass in any arbitrary 64-bit size.

Why do you think that "Just do what the user asked for" is obviously
the right thing?

What if I as a user asked to overwrite /etc/passwd with my own data?
Would that be right?

And if that isn't right, then why do you assume it is right that you
can do infinite readahead?

Now, it is entirely possible that we should raise the limit (note that
"raise" is different from "remove"). But that requires

 (a) thought
 (b) data

and this patch had neither.

If we raise the limit, we need to do so intelligently. It's almost
certainly going to involve looking at how much free memory we have,
because part of "readahead" is very much also "don't disturb other
users and IO too much".

The current choice for the limit is "small enough that we don't need
to think too much about it". If we raise it to hundreds of megs, we
very definitely will want to think about it much more. We might also
want to make sure that the operation can be properly aborted,
something we haven't needed to worry about for the small limit we have
now.

                   Linus

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] mm: readahead: do not cap readahead() and MADV_WILLNEED
  2016-02-24  2:34 ` Linus Torvalds
@ 2016-02-29 19:41   ` Johannes Weiner
  2016-02-29 21:03     ` Kirill A. Shutemov
  0 siblings, 1 reply; 4+ messages in thread
From: Johannes Weiner @ 2016-02-29 19:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, linux-mm, Linux Kernel Mailing List, kernel-team

On Tue, Feb 23, 2016 at 06:34:59PM -0800, Linus Torvalds wrote:
> Why do you think that "Just do what the user asked for" is obviously
> the right thing?

In our situation, we are trying to prime the cache for what we know
will be the definite workingset of the application. We don't care if
it maxes out the IO capacity, and we don't care if it throws out any
existing cache to accomplish the task. In fact, if you're sure about
the workingset, that is desired behavior. It's basically read(), but
without the pointless copying and waiting for completion.

One of the mistakes I made was to look only at the manpage, and not at
how readahead() is or has historically been used in the field.

One such usecase is warming the system during bootup, where system
software fires off readahead against all manner of libraries and
executables that are likely to be used. In that scenario the caller
really doesn't know for sure it's reading the right thing. And if not,
the optimistic readahead shouldn't vaccuum up all the resources and
interfere with the IO and memory demands of the *actual* workingset.

It seems that the optimistic readahead during bootup is being phased
out nowadays. Systemd took over with systemd-readahead, then dropped
it eventually citing lack of desired performance benefits and
relevance; there is another project called preload but it appears
defunct as well. For all we know, though, there still are people who
fire off optimistic readahead, and we can't regress them. Certainly
older or customized userspace still running bootup readahead, or maybe
comparable applications where workingsets are estimated by heuristics.

It's unfortunate, because I frankly doubt we ever got the "else" part,
the not-interfering-with-the-real-workload part, working anyway. The
fact that distros are moving away from it or that we ended up limiting
the window to near-ineffective levels seem to be a symptoms of that.
That means the facility is now stuck somewhere in between questionable
for optimistic readahead and not useful for reliable cache priming.

We can't really make it work for both cases as their requirements are
in direct conflict with each other. Lowering the limit from cache+free
to 128k was a regression for priming a big known workingset, but there
is also no point in going back now and risk regressing the other side.

So it appears best to add a new syscall with clearly defined semantics
to forcefully prime the cache.

That, or switch to read() from a separate thread for cache priming.

Hmm?

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] mm: readahead: do not cap readahead() and MADV_WILLNEED
  2016-02-29 19:41   ` Johannes Weiner
@ 2016-02-29 21:03     ` Kirill A. Shutemov
  0 siblings, 0 replies; 4+ messages in thread
From: Kirill A. Shutemov @ 2016-02-29 21:03 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Linus Torvalds, Andrew Morton, linux-mm,
	Linux Kernel Mailing List, kernel-team

On Mon, Feb 29, 2016 at 02:41:59PM -0500, Johannes Weiner wrote:
> That, or switch to read() from a separate thread for cache priming.

mmap(MAP_POPULATE) would save you some copy_to_user() overhead, comparing
to read().

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2016-02-29 21:03 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-24  1:38 [PATCH] mm: readahead: do not cap readahead() and MADV_WILLNEED Johannes Weiner
2016-02-24  2:34 ` Linus Torvalds
2016-02-29 19:41   ` Johannes Weiner
2016-02-29 21:03     ` Kirill A. Shutemov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).