Re: [LSF/MM/BPF TOPIC] Do not pin pages for various direct-io scheme

From: Jerome Glisse <jglisse@redhat.com>
To: Jens Axboe <axboe@kernel.dk>
Cc: Michal Hocko <mhocko@kernel.org>,
	lsf-pc@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org,
	linux-mm@kvack.org, Benjamin LaHaise <bcrl@kvack.org>
Subject: Re: [LSF/MM/BPF TOPIC] Do not pin pages for various direct-io scheme
Date: Wed, 22 Jan 2020 09:40:59 -0800	[thread overview]
Message-ID: <20200122174059.GA7033@redhat.com> (raw)
In-Reply-To: <00864312-13cc-daac-36e8-5f3f5b6dbeb8@kernel.dk>

On Wed, Jan 22, 2020 at 10:38:56AM -0700, Jens Axboe wrote:
> On 1/22/20 10:28 AM, Jerome Glisse wrote:
> > On Wed, Jan 22, 2020 at 10:04:44AM -0700, Jens Axboe wrote:
> >> On 1/22/20 9:54 AM, Jerome Glisse wrote:
> >>> On Wed, Jan 22, 2020 at 08:12:51AM -0700, Jens Axboe wrote:
> >>>> On 1/22/20 4:59 AM, Michal Hocko wrote:
> >>>>> On Tue 21-01-20 20:57:23, Jerome Glisse wrote:
> >>>>>> We can also discuss what kind of knobs we want to expose so that
> >>>>>> people can decide to choose the tradeof themself (ie from i want low
> >>>>>> latency io-uring and i don't care wether mm can not do its business; to
> >>>>>> i want mm to never be impeded in its business and i accept the extra
> >>>>>> latency burst i might face in io operations).
> >>>>>
> >>>>> I do not think it is a good idea to make this configurable. How can
> >>>>> people sensibly choose between the two without deep understanding of
> >>>>> internals?
> >>>>
> >>>> Fully agree, we can't just punt this to a knob and call it good, that's
> >>>> a typical fallacy of core changes. And there is only one mode for
> >>>> io_uring, and that's consistent low latency. If this change introduces
> >>>> weird reclaim, compaction or migration latencies, then that's a
> >>>> non-starter as far as I'm concerned.
> >>>>
> >>>> And what do those two settings even mean? I don't even know, and a user
> >>>> sure as hell doesn't either.
> >>>>
> >>>> io_uring pins two types of pages - registered buffers, these are used
> >>>> for actual IO, and the rings themselves. The rings are not used for IO,
> >>>> just used to communicate between the application and the kernel.
> >>>
> >>> So, do we still want to solve file back pages write back if page in
> >>> ubuffer are from a file ?
> >>
> >> That's not currently a concern for io_uring, as it disallows file backed
> >> pages for the IO buffers that are being registered.
> >>
> >>> Also we can introduce a flag when registering buffer that allows to
> >>> register buffer without pining and thus avoid the RLIMIT_MEMLOCK at
> >>> the cost of possible latency spike. Then user registering the buffer
> >>> knows what he gets.
> >>
> >> That may be fine for others users, but I don't think it'll apply
> >> to io_uring. I can't see anyone selecting that flag, unless you're
> >> doing something funky where you're registering a substantial amount
> >> of the system memory for IO buffers. And I don't think that's going
> >> to be a super valid use case...
> > 
> > Given dataset are getting bigger and bigger i would assume that we
> > will have people who want to use io-uring with large buffer.
> > 
> >>
> >>> Maybe it would be good to test, it might stay in the noise, then it
> >>> might be a good thing to do. Also they are strategy to avoid latency
> >>> spike for instance we can block/force skip mm invalidation if buffer
> >>> has pending/running io in the ring ie only have buffer invalidation
> >>> happens when there is no pending/running submission entry.
> >>
> >> Would that really work? The buffer could very well be idle right when
> >> you check, but wanting to do IO the instant you decide you can do
> >> background work on it. Additionally, that would require accounting
> >> on when the buffers are inflight, which is exactly the kind of
> >> overhead we're trying to avoid to begin with.
> >>
> >>> We can also pick what kind of invalidation we allow (compaction,
> >>> migration, ...) and thus limit the scope and likelyhood of
> >>> invalidation.
> >>
> >> I think it'd be useful to try and understand the use case first.
> >> If we're pinning a small percentage of the system memory, do we
> >> really care at all? Isn't it completely fine to just ignore?
> > 
> > My main motivation is migration in NUMA system, if the process that
> > did register buffer get migrated to a different node then it might
> > actualy end up with bad performance because its io buffer are still
> > on hold node. I am not sure we want to tell application developer to
> > constantly monitor which node they are on and to re-register buffer
> > after process migration to allow for memory migration.
> 
> If the process truly cares, would it not have pinned itself to that
> node?

Not necesarily, programmer can not thing of everything and also process
pinning defeat load balancing. Moreover we now have to thing about deep
memory topology ie by the time you register the buffer the page backing
it might be from slower memory and then all your io and CPU access will
be stuck on using that.

Cheers,
Jérôme