[LSF/MM/BPF TOPIC] Do not pin pages for various direct-io scheme

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [LSF/MM/BPF TOPIC] Do not pin pages for various direct-io scheme
@ 2020-01-22  2:31 jglisse
  2020-01-22  3:54 ` Jens Axboe
  2020-01-22  4:19 ` Dan Williams
  0 siblings, 2 replies; 16+ messages in thread
From: jglisse @ 2020-01-22  2:31 UTC (permalink / raw)
  To: lsf-pc
  Cc: Jérôme Glisse, linux-fsdevel, linux-mm, Jens Axboe,
	Benjamin LaHaise

From: Jérôme Glisse <jglisse@redhat.com>

Direct I/O does pin memory through GUP (get user page) this does
block several mm activities like:
    - compaction
    - numa
    - migration
    ...

It is also troublesome if the pinned pages are actualy file back
pages that migth go under writeback. In which case the page can
not be write protected from direct-io point of view (see various
discussion about recent work on GUP [1]). This does happens for
instance if the virtual memory address use as buffer for read
operation is the outcome of an mmap of a regular file.

With direct-io or aio (asynchronous io) pages are pinned until
syscall completion (which depends on many factors: io size,
block device speed, ...). For io-uring pages can be pinned an
indifinite amount of time.

So i would like to convert direct io code (direct-io, aio and
io-uring) to obey mmu notifier and thus allow memory management
and writeback to work and behave like any other process memory.

For direct-io and aio this mostly gives a way to wait on syscall
completion. For io-uring this means that buffer might need to be
re-validated (ie looking up pages again to get the new set of
pages for the buffer). Impact for io-uring is the delay needed
to lookup new pages or wait on writeback (if necessary). This
would only happens _if_ an invalidation event happens, which it-
self should only happen under memory preissure or for NUMA
activities.

They are ways to minimize the impact (for instance by using the
mmu notifier type to ignore some invalidation cases).

So i would like to discuss all this during LSF, it is mostly a
filesystem discussion with strong tie to mm.

[1] GUP https://lkml.org/lkml/2019/3/8/805 and all subsequent
    discussion.

To: lsf-pc@lists.linux-foundation.org
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Benjamin LaHaise <bcrl@kvack.org>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Do not pin pages for various direct-io scheme
  2020-01-22  2:31 [LSF/MM/BPF TOPIC] Do not pin pages for various direct-io scheme jglisse
@ 2020-01-22  3:54 ` Jens Axboe
  2020-01-22  4:57   ` Jerome Glisse
  2020-01-27 19:01   ` Jason Gunthorpe
  2020-01-22  4:19 ` Dan Williams
  1 sibling, 2 replies; 16+ messages in thread
From: Jens Axboe @ 2020-01-22  3:54 UTC (permalink / raw)
  To: jglisse, lsf-pc; +Cc: linux-fsdevel, linux-mm, Benjamin LaHaise

On 1/21/20 7:31 PM, jglisse@redhat.com wrote:
> From: Jérôme Glisse <jglisse@redhat.com>
> 
> Direct I/O does pin memory through GUP (get user page) this does
> block several mm activities like:
>     - compaction
>     - numa
>     - migration
>     ...
> 
> It is also troublesome if the pinned pages are actualy file back
> pages that migth go under writeback. In which case the page can
> not be write protected from direct-io point of view (see various
> discussion about recent work on GUP [1]). This does happens for
> instance if the virtual memory address use as buffer for read
> operation is the outcome of an mmap of a regular file.
> 
> 
> With direct-io or aio (asynchronous io) pages are pinned until
> syscall completion (which depends on many factors: io size,
> block device speed, ...). For io-uring pages can be pinned an
> indifinite amount of time.
> 
> 
> So i would like to convert direct io code (direct-io, aio and
> io-uring) to obey mmu notifier and thus allow memory management
> and writeback to work and behave like any other process memory.
> 
> For direct-io and aio this mostly gives a way to wait on syscall
> completion. For io-uring this means that buffer might need to be
> re-validated (ie looking up pages again to get the new set of
> pages for the buffer). Impact for io-uring is the delay needed
> to lookup new pages or wait on writeback (if necessary). This
> would only happens _if_ an invalidation event happens, which it-
> self should only happen under memory preissure or for NUMA
> activities.
> 
> They are ways to minimize the impact (for instance by using the
> mmu notifier type to ignore some invalidation cases).
> 
> 
> So i would like to discuss all this during LSF, it is mostly a
> filesystem discussion with strong tie to mm.

I'd be interested in this topic, as it pertains to io_uring. The whole
point of registered buffers is to avoid mapping overhead, and page
references. If we add extra overhead per operation for that, well... I'm
assuming the above is strictly for file mapped pages? Or also page
migration?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Do not pin pages for various direct-io scheme
  2020-01-22  2:31 [LSF/MM/BPF TOPIC] Do not pin pages for various direct-io scheme jglisse
  2020-01-22  3:54 ` Jens Axboe
@ 2020-01-22  4:19 ` Dan Williams
  2020-01-22  5:00   ` Jerome Glisse
  1 sibling, 1 reply; 16+ messages in thread
From: Dan Williams @ 2020-01-22  4:19 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: lsf-pc, linux-fsdevel, linux-mm, Jens Axboe, Benjamin LaHaise

On Tue, Jan 21, 2020 at 6:34 PM <jglisse@redhat.com> wrote:
>
> From: Jérôme Glisse <jglisse@redhat.com>
>
> Direct I/O does pin memory through GUP (get user page) this does
> block several mm activities like:
>     - compaction
>     - numa
>     - migration
>     ...
>
> It is also troublesome if the pinned pages are actualy file back
> pages that migth go under writeback. In which case the page can
> not be write protected from direct-io point of view (see various
> discussion about recent work on GUP [1]). This does happens for
> instance if the virtual memory address use as buffer for read
> operation is the outcome of an mmap of a regular file.
>
>
> With direct-io or aio (asynchronous io) pages are pinned until
> syscall completion (which depends on many factors: io size,
> block device speed, ...). For io-uring pages can be pinned an
> indifinite amount of time.
>
>
> So i would like to convert direct io code (direct-io, aio and
> io-uring) to obey mmu notifier and thus allow memory management
> and writeback to work and behave like any other process memory.
>
> For direct-io and aio this mostly gives a way to wait on syscall
> completion. For io-uring this means that buffer might need to be
> re-validated (ie looking up pages again to get the new set of
> pages for the buffer). Impact for io-uring is the delay needed
> to lookup new pages or wait on writeback (if necessary). This
> would only happens _if_ an invalidation event happens, which it-
> self should only happen under memory preissure or for NUMA
> activities.

This seems to assume that memory pressure and NUMA migration are rare
events. Some of the proposed hierarchical memory management schemes
[1] might impact that assumption.

[1]: http://lore.kernel.org/r/20191101075727.26683-1-ying.huang@intel.com/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Do not pin pages for various direct-io scheme
  2020-01-22  3:54 ` Jens Axboe
@ 2020-01-22  4:57   ` Jerome Glisse
  2020-01-22 11:59     ` Michal Hocko
  2020-01-27 19:01   ` Jason Gunthorpe
  1 sibling, 1 reply; 16+ messages in thread
From: Jerome Glisse @ 2020-01-22  4:57 UTC (permalink / raw)
  To: Jens Axboe; +Cc: lsf-pc, linux-fsdevel, linux-mm, Benjamin LaHaise

On Tue, Jan 21, 2020 at 08:54:22PM -0700, Jens Axboe wrote:
> On 1/21/20 7:31 PM, jglisse@redhat.com wrote:
> > From: Jérôme Glisse <jglisse@redhat.com>
> > 
> > Direct I/O does pin memory through GUP (get user page) this does
> > block several mm activities like:
> >     - compaction
> >     - numa
> >     - migration
> >     ...
> > 
> > It is also troublesome if the pinned pages are actualy file back
> > pages that migth go under writeback. In which case the page can
> > not be write protected from direct-io point of view (see various
> > discussion about recent work on GUP [1]). This does happens for
> > instance if the virtual memory address use as buffer for read
> > operation is the outcome of an mmap of a regular file.
> > 
> > 
> > With direct-io or aio (asynchronous io) pages are pinned until
> > syscall completion (which depends on many factors: io size,
> > block device speed, ...). For io-uring pages can be pinned an
> > indifinite amount of time.
> > 
> > 
> > So i would like to convert direct io code (direct-io, aio and
> > io-uring) to obey mmu notifier and thus allow memory management
> > and writeback to work and behave like any other process memory.
> > 
> > For direct-io and aio this mostly gives a way to wait on syscall
> > completion. For io-uring this means that buffer might need to be
> > re-validated (ie looking up pages again to get the new set of
> > pages for the buffer). Impact for io-uring is the delay needed
> > to lookup new pages or wait on writeback (if necessary). This
> > would only happens _if_ an invalidation event happens, which it-
> > self should only happen under memory preissure or for NUMA
> > activities.
> > 
> > They are ways to minimize the impact (for instance by using the
> > mmu notifier type to ignore some invalidation cases).
> > 
> > 
> > So i would like to discuss all this during LSF, it is mostly a
> > filesystem discussion with strong tie to mm.
> 
> I'd be interested in this topic, as it pertains to io_uring. The whole
> point of registered buffers is to avoid mapping overhead, and page
> references. If we add extra overhead per operation for that, well... I'm
> assuming the above is strictly for file mapped pages? Or also page
> migration?

File back page and anonymous, the idea is that we have choice on what
to do, ie favor io-uring and make it last resort for mm to mess with a
page that is GUPed or we could favor mm (compaction, NUMA, reclaim,
...). We can also discuss what kind of knobs we want to expose so that
people can decide to choose the tradeof themself (ie from i want low
latency io-uring and i don't care wether mm can not do its business; to
i want mm to never be impeded in its business and i accept the extra
latency burst i might face in io operations).

One of the issue with io-uring AFAICT is that today someone could
potentialy pin pages that are never actualy use by direct io and thus
potential DDOS or mm starve others.

Cheers,
Jérôme


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Do not pin pages for various direct-io scheme
  2020-01-22  4:19 ` Dan Williams
@ 2020-01-22  5:00   ` Jerome Glisse
  2020-01-22 15:56     ` [Lsf-pc] " Dan Williams
  0 siblings, 1 reply; 16+ messages in thread
From: Jerome Glisse @ 2020-01-22  5:00 UTC (permalink / raw)
  To: Dan Williams
  Cc: lsf-pc, linux-fsdevel, linux-mm, Jens Axboe, Benjamin LaHaise

On Tue, Jan 21, 2020 at 08:19:54PM -0800, Dan Williams wrote:
> On Tue, Jan 21, 2020 at 6:34 PM <jglisse@redhat.com> wrote:
> >
> > From: Jérôme Glisse <jglisse@redhat.com>
> >
> > Direct I/O does pin memory through GUP (get user page) this does
> > block several mm activities like:
> >     - compaction
> >     - numa
> >     - migration
> >     ...
> >
> > It is also troublesome if the pinned pages are actualy file back
> > pages that migth go under writeback. In which case the page can
> > not be write protected from direct-io point of view (see various
> > discussion about recent work on GUP [1]). This does happens for
> > instance if the virtual memory address use as buffer for read
> > operation is the outcome of an mmap of a regular file.
> >
> >
> > With direct-io or aio (asynchronous io) pages are pinned until
> > syscall completion (which depends on many factors: io size,
> > block device speed, ...). For io-uring pages can be pinned an
> > indifinite amount of time.
> >
> >
> > So i would like to convert direct io code (direct-io, aio and
> > io-uring) to obey mmu notifier and thus allow memory management
> > and writeback to work and behave like any other process memory.
> >
> > For direct-io and aio this mostly gives a way to wait on syscall
> > completion. For io-uring this means that buffer might need to be
> > re-validated (ie looking up pages again to get the new set of
> > pages for the buffer). Impact for io-uring is the delay needed
> > to lookup new pages or wait on writeback (if necessary). This
> > would only happens _if_ an invalidation event happens, which it-
> > self should only happen under memory preissure or for NUMA
> > activities.
> 
> This seems to assume that memory pressure and NUMA migration are rare
> events. Some of the proposed hierarchical memory management schemes
> [1] might impact that assumption.
> 
> [1]: http://lore.kernel.org/r/20191101075727.26683-1-ying.huang@intel.com/
> 

Yes, it is true that it will likely becomes more and more an issues.
We are facing a tough choice here as pining block NUMA or any kind of
migration and thus might impede performance while invalidating an io-
uring buffer will also cause a small latency burst. I do not think we
can make everyone happy but at very least we should avoid pining and
provide knobs to let user decide what they care more about (ie io with-
out burst or better NUMA locality).

Cheers,
Jérôme


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Do not pin pages for various direct-io scheme
  2020-01-22  4:57   ` Jerome Glisse
@ 2020-01-22 11:59     ` Michal Hocko
  2020-01-22 15:12       ` Jens Axboe
  0 siblings, 1 reply; 16+ messages in thread
From: Michal Hocko @ 2020-01-22 11:59 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Jens Axboe, lsf-pc, linux-fsdevel, linux-mm, Benjamin LaHaise

On Tue 21-01-20 20:57:23, Jerome Glisse wrote:
> We can also discuss what kind of knobs we want to expose so that
> people can decide to choose the tradeof themself (ie from i want low
> latency io-uring and i don't care wether mm can not do its business; to
> i want mm to never be impeded in its business and i accept the extra
> latency burst i might face in io operations).

I do not think it is a good idea to make this configurable. How can
people sensibly choose between the two without deep understanding of
internals?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Do not pin pages for various direct-io scheme
  2020-01-22 11:59     ` Michal Hocko
@ 2020-01-22 15:12       ` Jens Axboe
  2020-01-22 16:54         ` Jerome Glisse
  0 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2020-01-22 15:12 UTC (permalink / raw)
  To: Michal Hocko, Jerome Glisse
  Cc: lsf-pc, linux-fsdevel, linux-mm, Benjamin LaHaise

On 1/22/20 4:59 AM, Michal Hocko wrote:
> On Tue 21-01-20 20:57:23, Jerome Glisse wrote:
>> We can also discuss what kind of knobs we want to expose so that
>> people can decide to choose the tradeof themself (ie from i want low
>> latency io-uring and i don't care wether mm can not do its business; to
>> i want mm to never be impeded in its business and i accept the extra
>> latency burst i might face in io operations).
> 
> I do not think it is a good idea to make this configurable. How can
> people sensibly choose between the two without deep understanding of
> internals?

Fully agree, we can't just punt this to a knob and call it good, that's
a typical fallacy of core changes. And there is only one mode for
io_uring, and that's consistent low latency. If this change introduces
weird reclaim, compaction or migration latencies, then that's a
non-starter as far as I'm concerned.

And what do those two settings even mean? I don't even know, and a user
sure as hell doesn't either.

io_uring pins two types of pages - registered buffers, these are used
for actual IO, and the rings themselves. The rings are not used for IO,
just used to communicate between the application and the kernel.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Do not pin pages for various direct-io scheme
  2020-01-22  5:00   ` Jerome Glisse
@ 2020-01-22 15:56     ` Dan Williams
  2020-01-22 17:02       ` Jerome Glisse
  0 siblings, 1 reply; 16+ messages in thread
From: Dan Williams @ 2020-01-22 15:56 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: linux-fsdevel, linux-mm, lsf-pc, Jens Axboe, Benjamin LaHaise

On Tue, Jan 21, 2020 at 9:04 PM Jerome Glisse <jglisse@redhat.com> wrote:
>
> On Tue, Jan 21, 2020 at 08:19:54PM -0800, Dan Williams wrote:
> > On Tue, Jan 21, 2020 at 6:34 PM <jglisse@redhat.com> wrote:
> > >
> > > From: Jérôme Glisse <jglisse@redhat.com>
> > >
> > > Direct I/O does pin memory through GUP (get user page) this does
> > > block several mm activities like:
> > >     - compaction
> > >     - numa
> > >     - migration
> > >     ...
> > >
> > > It is also troublesome if the pinned pages are actualy file back
> > > pages that migth go under writeback. In which case the page can
> > > not be write protected from direct-io point of view (see various
> > > discussion about recent work on GUP [1]). This does happens for
> > > instance if the virtual memory address use as buffer for read
> > > operation is the outcome of an mmap of a regular file.
> > >
> > >
> > > With direct-io or aio (asynchronous io) pages are pinned until
> > > syscall completion (which depends on many factors: io size,
> > > block device speed, ...). For io-uring pages can be pinned an
> > > indifinite amount of time.
> > >
> > >
> > > So i would like to convert direct io code (direct-io, aio and
> > > io-uring) to obey mmu notifier and thus allow memory management
> > > and writeback to work and behave like any other process memory.
> > >
> > > For direct-io and aio this mostly gives a way to wait on syscall
> > > completion. For io-uring this means that buffer might need to be
> > > re-validated (ie looking up pages again to get the new set of
> > > pages for the buffer). Impact for io-uring is the delay needed
> > > to lookup new pages or wait on writeback (if necessary). This
> > > would only happens _if_ an invalidation event happens, which it-
> > > self should only happen under memory preissure or for NUMA
> > > activities.
> >
> > This seems to assume that memory pressure and NUMA migration are rare
> > events. Some of the proposed hierarchical memory management schemes
> > [1] might impact that assumption.
> >
> > [1]: http://lore.kernel.org/r/20191101075727.26683-1-ying.huang@intel.com/
> >
>
> Yes, it is true that it will likely becomes more and more an issues.
> We are facing a tough choice here as pining block NUMA or any kind of
> migration and thus might impede performance while invalidating an io-
> uring buffer will also cause a small latency burst. I do not think we
> can make everyone happy but at very least we should avoid pining and
> provide knobs to let user decide what they care more about (ie io with-
> out burst or better NUMA locality).

It's a question of tradeoffs and this proposal seems to have already
decided that the question should be answered in favor a GPU/SVM
centric view of the world without presenting the alternative.
Direct-I/O colliding with GPU operations might also be solved by
always triggering a migration, and applications that care would avoid
colliding operations that slow down their GPU workload. A slow compat
fallback that applications can programmatically avoid is more flexible
than an upfront knob.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Do not pin pages for various direct-io scheme
  2020-01-22 15:12       ` Jens Axboe
@ 2020-01-22 16:54         ` Jerome Glisse
  2020-01-22 17:04           ` Jens Axboe
  0 siblings, 1 reply; 16+ messages in thread
From: Jerome Glisse @ 2020-01-22 16:54 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Michal Hocko, lsf-pc, linux-fsdevel, linux-mm, Benjamin LaHaise

On Wed, Jan 22, 2020 at 08:12:51AM -0700, Jens Axboe wrote:
> On 1/22/20 4:59 AM, Michal Hocko wrote:
> > On Tue 21-01-20 20:57:23, Jerome Glisse wrote:
> >> We can also discuss what kind of knobs we want to expose so that
> >> people can decide to choose the tradeof themself (ie from i want low
> >> latency io-uring and i don't care wether mm can not do its business; to
> >> i want mm to never be impeded in its business and i accept the extra
> >> latency burst i might face in io operations).
> > 
> > I do not think it is a good idea to make this configurable. How can
> > people sensibly choose between the two without deep understanding of
> > internals?
> 
> Fully agree, we can't just punt this to a knob and call it good, that's
> a typical fallacy of core changes. And there is only one mode for
> io_uring, and that's consistent low latency. If this change introduces
> weird reclaim, compaction or migration latencies, then that's a
> non-starter as far as I'm concerned.
> 
> And what do those two settings even mean? I don't even know, and a user
> sure as hell doesn't either.
> 
> io_uring pins two types of pages - registered buffers, these are used
> for actual IO, and the rings themselves. The rings are not used for IO,
> just used to communicate between the application and the kernel.

So, do we still want to solve file back pages write back if page in
ubuffer are from a file ?

Also we can introduce a flag when registering buffer that allows to
register buffer without pining and thus avoid the RLIMIT_MEMLOCK at
the cost of possible latency spike. Then user registering the buffer
knows what he gets.

Maybe it would be good to test, it might stay in the noise, then it
might be a good thing to do. Also they are strategy to avoid latency
spike for instance we can block/force skip mm invalidation if buffer
has pending/running io in the ring ie only have buffer invalidation
happens when there is no pending/running submission entry.

We can also pick what kind of invalidation we allow (compaction,
migration, ...) and thus limit the scope and likelyhood of
invalidation.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Do not pin pages for various direct-io scheme
  2020-01-22 15:56     ` [Lsf-pc] " Dan Williams
@ 2020-01-22 17:02       ` Jerome Glisse
  0 siblings, 0 replies; 16+ messages in thread
From: Jerome Glisse @ 2020-01-22 17:02 UTC (permalink / raw)
  To: Dan Williams
  Cc: linux-fsdevel, linux-mm, lsf-pc, Jens Axboe, Benjamin LaHaise

On Wed, Jan 22, 2020 at 07:56:50AM -0800, Dan Williams wrote:
> On Tue, Jan 21, 2020 at 9:04 PM Jerome Glisse <jglisse@redhat.com> wrote:
> >
> > On Tue, Jan 21, 2020 at 08:19:54PM -0800, Dan Williams wrote:
> > > On Tue, Jan 21, 2020 at 6:34 PM <jglisse@redhat.com> wrote:
> > > >
> > > > From: Jérôme Glisse <jglisse@redhat.com>
> > > >
> > > > Direct I/O does pin memory through GUP (get user page) this does
> > > > block several mm activities like:
> > > >     - compaction
> > > >     - numa
> > > >     - migration
> > > >     ...
> > > >
> > > > It is also troublesome if the pinned pages are actualy file back
> > > > pages that migth go under writeback. In which case the page can
> > > > not be write protected from direct-io point of view (see various
> > > > discussion about recent work on GUP [1]). This does happens for
> > > > instance if the virtual memory address use as buffer for read
> > > > operation is the outcome of an mmap of a regular file.
> > > >
> > > >
> > > > With direct-io or aio (asynchronous io) pages are pinned until
> > > > syscall completion (which depends on many factors: io size,
> > > > block device speed, ...). For io-uring pages can be pinned an
> > > > indifinite amount of time.
> > > >
> > > >
> > > > So i would like to convert direct io code (direct-io, aio and
> > > > io-uring) to obey mmu notifier and thus allow memory management
> > > > and writeback to work and behave like any other process memory.
> > > >
> > > > For direct-io and aio this mostly gives a way to wait on syscall
> > > > completion. For io-uring this means that buffer might need to be
> > > > re-validated (ie looking up pages again to get the new set of
> > > > pages for the buffer). Impact for io-uring is the delay needed
> > > > to lookup new pages or wait on writeback (if necessary). This
> > > > would only happens _if_ an invalidation event happens, which it-
> > > > self should only happen under memory preissure or for NUMA
> > > > activities.
> > >
> > > This seems to assume that memory pressure and NUMA migration are rare
> > > events. Some of the proposed hierarchical memory management schemes
> > > [1] might impact that assumption.
> > >
> > > [1]: http://lore.kernel.org/r/20191101075727.26683-1-ying.huang@intel.com/
> > >
> >
> > Yes, it is true that it will likely becomes more and more an issues.
> > We are facing a tough choice here as pining block NUMA or any kind of
> > migration and thus might impede performance while invalidating an io-
> > uring buffer will also cause a small latency burst. I do not think we
> > can make everyone happy but at very least we should avoid pining and
> > provide knobs to let user decide what they care more about (ie io with-
> > out burst or better NUMA locality).
> 
> It's a question of tradeoffs and this proposal seems to have already
> decided that the question should be answered in favor a GPU/SVM
> centric view of the world without presenting the alternative.
> Direct-I/O colliding with GPU operations might also be solved by
> always triggering a migration, and applications that care would avoid
> colliding operations that slow down their GPU workload. A slow compat
> fallback that applications can programmatically avoid is more flexible
> than an upfront knob.

To make it clear i do not care about direct I/O colliding with anything
GPU or otherwise, anything like that is up to the application programmer.

My sole interest is with page pinning that block compaction and migration.
The former imped the kernel capability to materialize huge page, the
latter can impact performance badly including for the direct i/o user.
For instance if the process using io-uring get migrated to different node
after registering its buffer then it will keep using memory from a
different node which in the end might be much worse then the one time
extra latency spike the migration incur.

Cheers,
Jérôme


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Do not pin pages for various direct-io scheme
  2020-01-22 16:54         ` Jerome Glisse
@ 2020-01-22 17:04           ` Jens Axboe
  2020-01-22 17:28             ` Jerome Glisse
  0 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2020-01-22 17:04 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Michal Hocko, lsf-pc, linux-fsdevel, linux-mm, Benjamin LaHaise

On 1/22/20 9:54 AM, Jerome Glisse wrote:
> On Wed, Jan 22, 2020 at 08:12:51AM -0700, Jens Axboe wrote:
>> On 1/22/20 4:59 AM, Michal Hocko wrote:
>>> On Tue 21-01-20 20:57:23, Jerome Glisse wrote:
>>>> We can also discuss what kind of knobs we want to expose so that
>>>> people can decide to choose the tradeof themself (ie from i want low
>>>> latency io-uring and i don't care wether mm can not do its business; to
>>>> i want mm to never be impeded in its business and i accept the extra
>>>> latency burst i might face in io operations).
>>>
>>> I do not think it is a good idea to make this configurable. How can
>>> people sensibly choose between the two without deep understanding of
>>> internals?
>>
>> Fully agree, we can't just punt this to a knob and call it good, that's
>> a typical fallacy of core changes. And there is only one mode for
>> io_uring, and that's consistent low latency. If this change introduces
>> weird reclaim, compaction or migration latencies, then that's a
>> non-starter as far as I'm concerned.
>>
>> And what do those two settings even mean? I don't even know, and a user
>> sure as hell doesn't either.
>>
>> io_uring pins two types of pages - registered buffers, these are used
>> for actual IO, and the rings themselves. The rings are not used for IO,
>> just used to communicate between the application and the kernel.
> 
> So, do we still want to solve file back pages write back if page in
> ubuffer are from a file ?

That's not currently a concern for io_uring, as it disallows file backed
pages for the IO buffers that are being registered.

> Also we can introduce a flag when registering buffer that allows to
> register buffer without pining and thus avoid the RLIMIT_MEMLOCK at
> the cost of possible latency spike. Then user registering the buffer
> knows what he gets.

That may be fine for others users, but I don't think it'll apply
to io_uring. I can't see anyone selecting that flag, unless you're
doing something funky where you're registering a substantial amount
of the system memory for IO buffers. And I don't think that's going
to be a super valid use case...

> Maybe it would be good to test, it might stay in the noise, then it
> might be a good thing to do. Also they are strategy to avoid latency
> spike for instance we can block/force skip mm invalidation if buffer
> has pending/running io in the ring ie only have buffer invalidation
> happens when there is no pending/running submission entry.

Would that really work? The buffer could very well be idle right when
you check, but wanting to do IO the instant you decide you can do
background work on it. Additionally, that would require accounting
on when the buffers are inflight, which is exactly the kind of
overhead we're trying to avoid to begin with.

> We can also pick what kind of invalidation we allow (compaction,
> migration, ...) and thus limit the scope and likelyhood of
> invalidation.

I think it'd be useful to try and understand the use case first.
If we're pinning a small percentage of the system memory, do we
really care at all? Isn't it completely fine to just ignore?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Do not pin pages for various direct-io scheme
  2020-01-22 17:04           ` Jens Axboe
@ 2020-01-22 17:28             ` Jerome Glisse
  2020-01-22 17:38               ` Jens Axboe
  0 siblings, 1 reply; 16+ messages in thread
From: Jerome Glisse @ 2020-01-22 17:28 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Michal Hocko, lsf-pc, linux-fsdevel, linux-mm, Benjamin LaHaise

On Wed, Jan 22, 2020 at 10:04:44AM -0700, Jens Axboe wrote:
> On 1/22/20 9:54 AM, Jerome Glisse wrote:
> > On Wed, Jan 22, 2020 at 08:12:51AM -0700, Jens Axboe wrote:
> >> On 1/22/20 4:59 AM, Michal Hocko wrote:
> >>> On Tue 21-01-20 20:57:23, Jerome Glisse wrote:
> >>>> We can also discuss what kind of knobs we want to expose so that
> >>>> people can decide to choose the tradeof themself (ie from i want low
> >>>> latency io-uring and i don't care wether mm can not do its business; to
> >>>> i want mm to never be impeded in its business and i accept the extra
> >>>> latency burst i might face in io operations).
> >>>
> >>> I do not think it is a good idea to make this configurable. How can
> >>> people sensibly choose between the two without deep understanding of
> >>> internals?
> >>
> >> Fully agree, we can't just punt this to a knob and call it good, that's
> >> a typical fallacy of core changes. And there is only one mode for
> >> io_uring, and that's consistent low latency. If this change introduces
> >> weird reclaim, compaction or migration latencies, then that's a
> >> non-starter as far as I'm concerned.
> >>
> >> And what do those two settings even mean? I don't even know, and a user
> >> sure as hell doesn't either.
> >>
> >> io_uring pins two types of pages - registered buffers, these are used
> >> for actual IO, and the rings themselves. The rings are not used for IO,
> >> just used to communicate between the application and the kernel.
> > 
> > So, do we still want to solve file back pages write back if page in
> > ubuffer are from a file ?
> 
> That's not currently a concern for io_uring, as it disallows file backed
> pages for the IO buffers that are being registered.
> 
> > Also we can introduce a flag when registering buffer that allows to
> > register buffer without pining and thus avoid the RLIMIT_MEMLOCK at
> > the cost of possible latency spike. Then user registering the buffer
> > knows what he gets.
> 
> That may be fine for others users, but I don't think it'll apply
> to io_uring. I can't see anyone selecting that flag, unless you're
> doing something funky where you're registering a substantial amount
> of the system memory for IO buffers. And I don't think that's going
> to be a super valid use case...

Given dataset are getting bigger and bigger i would assume that we
will have people who want to use io-uring with large buffer.

> 
> > Maybe it would be good to test, it might stay in the noise, then it
> > might be a good thing to do. Also they are strategy to avoid latency
> > spike for instance we can block/force skip mm invalidation if buffer
> > has pending/running io in the ring ie only have buffer invalidation
> > happens when there is no pending/running submission entry.
> 
> Would that really work? The buffer could very well be idle right when
> you check, but wanting to do IO the instant you decide you can do
> background work on it. Additionally, that would require accounting
> on when the buffers are inflight, which is exactly the kind of
> overhead we're trying to avoid to begin with.
> 
> > We can also pick what kind of invalidation we allow (compaction,
> > migration, ...) and thus limit the scope and likelyhood of
> > invalidation.
> 
> I think it'd be useful to try and understand the use case first.
> If we're pinning a small percentage of the system memory, do we
> really care at all? Isn't it completely fine to just ignore?

My main motivation is migration in NUMA system, if the process that
did register buffer get migrated to a different node then it might
actualy end up with bad performance because its io buffer are still
on hold node. I am not sure we want to tell application developer to
constantly monitor which node they are on and to re-register buffer
after process migration to allow for memory migration.

Cheers,
Jérôme


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Do not pin pages for various direct-io scheme
  2020-01-22 17:28             ` Jerome Glisse
@ 2020-01-22 17:38               ` Jens Axboe
  2020-01-22 17:40                 ` Jerome Glisse
  0 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2020-01-22 17:38 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Michal Hocko, lsf-pc, linux-fsdevel, linux-mm, Benjamin LaHaise

On 1/22/20 10:28 AM, Jerome Glisse wrote:
> On Wed, Jan 22, 2020 at 10:04:44AM -0700, Jens Axboe wrote:
>> On 1/22/20 9:54 AM, Jerome Glisse wrote:
>>> On Wed, Jan 22, 2020 at 08:12:51AM -0700, Jens Axboe wrote:
>>>> On 1/22/20 4:59 AM, Michal Hocko wrote:
>>>>> On Tue 21-01-20 20:57:23, Jerome Glisse wrote:
>>>>>> We can also discuss what kind of knobs we want to expose so that
>>>>>> people can decide to choose the tradeof themself (ie from i want low
>>>>>> latency io-uring and i don't care wether mm can not do its business; to
>>>>>> i want mm to never be impeded in its business and i accept the extra
>>>>>> latency burst i might face in io operations).
>>>>>
>>>>> I do not think it is a good idea to make this configurable. How can
>>>>> people sensibly choose between the two without deep understanding of
>>>>> internals?
>>>>
>>>> Fully agree, we can't just punt this to a knob and call it good, that's
>>>> a typical fallacy of core changes. And there is only one mode for
>>>> io_uring, and that's consistent low latency. If this change introduces
>>>> weird reclaim, compaction or migration latencies, then that's a
>>>> non-starter as far as I'm concerned.
>>>>
>>>> And what do those two settings even mean? I don't even know, and a user
>>>> sure as hell doesn't either.
>>>>
>>>> io_uring pins two types of pages - registered buffers, these are used
>>>> for actual IO, and the rings themselves. The rings are not used for IO,
>>>> just used to communicate between the application and the kernel.
>>>
>>> So, do we still want to solve file back pages write back if page in
>>> ubuffer are from a file ?
>>
>> That's not currently a concern for io_uring, as it disallows file backed
>> pages for the IO buffers that are being registered.
>>
>>> Also we can introduce a flag when registering buffer that allows to
>>> register buffer without pining and thus avoid the RLIMIT_MEMLOCK at
>>> the cost of possible latency spike. Then user registering the buffer
>>> knows what he gets.
>>
>> That may be fine for others users, but I don't think it'll apply
>> to io_uring. I can't see anyone selecting that flag, unless you're
>> doing something funky where you're registering a substantial amount
>> of the system memory for IO buffers. And I don't think that's going
>> to be a super valid use case...
> 
> Given dataset are getting bigger and bigger i would assume that we
> will have people who want to use io-uring with large buffer.
> 
>>
>>> Maybe it would be good to test, it might stay in the noise, then it
>>> might be a good thing to do. Also they are strategy to avoid latency
>>> spike for instance we can block/force skip mm invalidation if buffer
>>> has pending/running io in the ring ie only have buffer invalidation
>>> happens when there is no pending/running submission entry.
>>
>> Would that really work? The buffer could very well be idle right when
>> you check, but wanting to do IO the instant you decide you can do
>> background work on it. Additionally, that would require accounting
>> on when the buffers are inflight, which is exactly the kind of
>> overhead we're trying to avoid to begin with.
>>
>>> We can also pick what kind of invalidation we allow (compaction,
>>> migration, ...) and thus limit the scope and likelyhood of
>>> invalidation.
>>
>> I think it'd be useful to try and understand the use case first.
>> If we're pinning a small percentage of the system memory, do we
>> really care at all? Isn't it completely fine to just ignore?
> 
> My main motivation is migration in NUMA system, if the process that
> did register buffer get migrated to a different node then it might
> actualy end up with bad performance because its io buffer are still
> on hold node. I am not sure we want to tell application developer to
> constantly monitor which node they are on and to re-register buffer
> after process migration to allow for memory migration.

If the process truly cares, would it not have pinned itself to that
node?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Do not pin pages for various direct-io scheme
  2020-01-22 17:38               ` Jens Axboe
@ 2020-01-22 17:40                 ` Jerome Glisse
  2020-01-22 17:49                   ` Jens Axboe
  0 siblings, 1 reply; 16+ messages in thread
From: Jerome Glisse @ 2020-01-22 17:40 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Michal Hocko, lsf-pc, linux-fsdevel, linux-mm, Benjamin LaHaise

On Wed, Jan 22, 2020 at 10:38:56AM -0700, Jens Axboe wrote:
> On 1/22/20 10:28 AM, Jerome Glisse wrote:
> > On Wed, Jan 22, 2020 at 10:04:44AM -0700, Jens Axboe wrote:
> >> On 1/22/20 9:54 AM, Jerome Glisse wrote:
> >>> On Wed, Jan 22, 2020 at 08:12:51AM -0700, Jens Axboe wrote:
> >>>> On 1/22/20 4:59 AM, Michal Hocko wrote:
> >>>>> On Tue 21-01-20 20:57:23, Jerome Glisse wrote:
> >>>>>> We can also discuss what kind of knobs we want to expose so that
> >>>>>> people can decide to choose the tradeof themself (ie from i want low
> >>>>>> latency io-uring and i don't care wether mm can not do its business; to
> >>>>>> i want mm to never be impeded in its business and i accept the extra
> >>>>>> latency burst i might face in io operations).
> >>>>>
> >>>>> I do not think it is a good idea to make this configurable. How can
> >>>>> people sensibly choose between the two without deep understanding of
> >>>>> internals?
> >>>>
> >>>> Fully agree, we can't just punt this to a knob and call it good, that's
> >>>> a typical fallacy of core changes. And there is only one mode for
> >>>> io_uring, and that's consistent low latency. If this change introduces
> >>>> weird reclaim, compaction or migration latencies, then that's a
> >>>> non-starter as far as I'm concerned.
> >>>>
> >>>> And what do those two settings even mean? I don't even know, and a user
> >>>> sure as hell doesn't either.
> >>>>
> >>>> io_uring pins two types of pages - registered buffers, these are used
> >>>> for actual IO, and the rings themselves. The rings are not used for IO,
> >>>> just used to communicate between the application and the kernel.
> >>>
> >>> So, do we still want to solve file back pages write back if page in
> >>> ubuffer are from a file ?
> >>
> >> That's not currently a concern for io_uring, as it disallows file backed
> >> pages for the IO buffers that are being registered.
> >>
> >>> Also we can introduce a flag when registering buffer that allows to
> >>> register buffer without pining and thus avoid the RLIMIT_MEMLOCK at
> >>> the cost of possible latency spike. Then user registering the buffer
> >>> knows what he gets.
> >>
> >> That may be fine for others users, but I don't think it'll apply
> >> to io_uring. I can't see anyone selecting that flag, unless you're
> >> doing something funky where you're registering a substantial amount
> >> of the system memory for IO buffers. And I don't think that's going
> >> to be a super valid use case...
> > 
> > Given dataset are getting bigger and bigger i would assume that we
> > will have people who want to use io-uring with large buffer.
> > 
> >>
> >>> Maybe it would be good to test, it might stay in the noise, then it
> >>> might be a good thing to do. Also they are strategy to avoid latency
> >>> spike for instance we can block/force skip mm invalidation if buffer
> >>> has pending/running io in the ring ie only have buffer invalidation
> >>> happens when there is no pending/running submission entry.
> >>
> >> Would that really work? The buffer could very well be idle right when
> >> you check, but wanting to do IO the instant you decide you can do
> >> background work on it. Additionally, that would require accounting
> >> on when the buffers are inflight, which is exactly the kind of
> >> overhead we're trying to avoid to begin with.
> >>
> >>> We can also pick what kind of invalidation we allow (compaction,
> >>> migration, ...) and thus limit the scope and likelyhood of
> >>> invalidation.
> >>
> >> I think it'd be useful to try and understand the use case first.
> >> If we're pinning a small percentage of the system memory, do we
> >> really care at all? Isn't it completely fine to just ignore?
> > 
> > My main motivation is migration in NUMA system, if the process that
> > did register buffer get migrated to a different node then it might
> > actualy end up with bad performance because its io buffer are still
> > on hold node. I am not sure we want to tell application developer to
> > constantly monitor which node they are on and to re-register buffer
> > after process migration to allow for memory migration.
> 
> If the process truly cares, would it not have pinned itself to that
> node?

Not necesarily, programmer can not thing of everything and also process
pinning defeat load balancing. Moreover we now have to thing about deep
memory topology ie by the time you register the buffer the page backing
it might be from slower memory and then all your io and CPU access will
be stuck on using that.

Cheers,
Jérôme


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Do not pin pages for various direct-io scheme
  2020-01-22 17:40                 ` Jerome Glisse
@ 2020-01-22 17:49                   ` Jens Axboe
  0 siblings, 0 replies; 16+ messages in thread
From: Jens Axboe @ 2020-01-22 17:49 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Michal Hocko, lsf-pc, linux-fsdevel, linux-mm, Benjamin LaHaise

On 1/22/20 10:40 AM, Jerome Glisse wrote:
> On Wed, Jan 22, 2020 at 10:38:56AM -0700, Jens Axboe wrote:
>> On 1/22/20 10:28 AM, Jerome Glisse wrote:
>>> On Wed, Jan 22, 2020 at 10:04:44AM -0700, Jens Axboe wrote:
>>>> On 1/22/20 9:54 AM, Jerome Glisse wrote:
>>>>> On Wed, Jan 22, 2020 at 08:12:51AM -0700, Jens Axboe wrote:
>>>>>> On 1/22/20 4:59 AM, Michal Hocko wrote:
>>>>>>> On Tue 21-01-20 20:57:23, Jerome Glisse wrote:
>>>>>>>> We can also discuss what kind of knobs we want to expose so that
>>>>>>>> people can decide to choose the tradeof themself (ie from i want low
>>>>>>>> latency io-uring and i don't care wether mm can not do its business; to
>>>>>>>> i want mm to never be impeded in its business and i accept the extra
>>>>>>>> latency burst i might face in io operations).
>>>>>>>
>>>>>>> I do not think it is a good idea to make this configurable. How can
>>>>>>> people sensibly choose between the two without deep understanding of
>>>>>>> internals?
>>>>>>
>>>>>> Fully agree, we can't just punt this to a knob and call it good, that's
>>>>>> a typical fallacy of core changes. And there is only one mode for
>>>>>> io_uring, and that's consistent low latency. If this change introduces
>>>>>> weird reclaim, compaction or migration latencies, then that's a
>>>>>> non-starter as far as I'm concerned.
>>>>>>
>>>>>> And what do those two settings even mean? I don't even know, and a user
>>>>>> sure as hell doesn't either.
>>>>>>
>>>>>> io_uring pins two types of pages - registered buffers, these are used
>>>>>> for actual IO, and the rings themselves. The rings are not used for IO,
>>>>>> just used to communicate between the application and the kernel.
>>>>>
>>>>> So, do we still want to solve file back pages write back if page in
>>>>> ubuffer are from a file ?
>>>>
>>>> That's not currently a concern for io_uring, as it disallows file backed
>>>> pages for the IO buffers that are being registered.
>>>>
>>>>> Also we can introduce a flag when registering buffer that allows to
>>>>> register buffer without pining and thus avoid the RLIMIT_MEMLOCK at
>>>>> the cost of possible latency spike. Then user registering the buffer
>>>>> knows what he gets.
>>>>
>>>> That may be fine for others users, but I don't think it'll apply
>>>> to io_uring. I can't see anyone selecting that flag, unless you're
>>>> doing something funky where you're registering a substantial amount
>>>> of the system memory for IO buffers. And I don't think that's going
>>>> to be a super valid use case...
>>>
>>> Given dataset are getting bigger and bigger i would assume that we
>>> will have people who want to use io-uring with large buffer.
>>>
>>>>
>>>>> Maybe it would be good to test, it might stay in the noise, then it
>>>>> might be a good thing to do. Also they are strategy to avoid latency
>>>>> spike for instance we can block/force skip mm invalidation if buffer
>>>>> has pending/running io in the ring ie only have buffer invalidation
>>>>> happens when there is no pending/running submission entry.
>>>>
>>>> Would that really work? The buffer could very well be idle right when
>>>> you check, but wanting to do IO the instant you decide you can do
>>>> background work on it. Additionally, that would require accounting
>>>> on when the buffers are inflight, which is exactly the kind of
>>>> overhead we're trying to avoid to begin with.
>>>>
>>>>> We can also pick what kind of invalidation we allow (compaction,
>>>>> migration, ...) and thus limit the scope and likelyhood of
>>>>> invalidation.
>>>>
>>>> I think it'd be useful to try and understand the use case first.
>>>> If we're pinning a small percentage of the system memory, do we
>>>> really care at all? Isn't it completely fine to just ignore?
>>>
>>> My main motivation is migration in NUMA system, if the process that
>>> did register buffer get migrated to a different node then it might
>>> actualy end up with bad performance because its io buffer are still
>>> on hold node. I am not sure we want to tell application developer to
>>> constantly monitor which node they are on and to re-register buffer
>>> after process migration to allow for memory migration.
>>
>> If the process truly cares, would it not have pinned itself to that
>> node?
> 
> Not necesarily, programmer can not thing of everything and also process

Node placement is generally the _first_ think you think of, though. It's
not like it's some esoteric thing that application developers don't know
anything about. Particularly if you're doing intensive IO, which you
probably are if you register buffers for use with io_uring. That ties to
a hardware device of some sort, or multiple ones. You would have placed
you memory local to that device as well.

> pinning defeat load balancing. Moreover we now have to thing about deep
> memory topology ie by the time you register the buffer the page backing
> it might be from slower memory and then all your io and CPU access will
> be stuck on using that.

To me, this sounds like some sort of event the application will want to
know about. And take appropriate measures.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Do not pin pages for various direct-io scheme
  2020-01-22  3:54 ` Jens Axboe
  2020-01-22  4:57   ` Jerome Glisse
@ 2020-01-27 19:01   ` Jason Gunthorpe
  1 sibling, 0 replies; 16+ messages in thread
From: Jason Gunthorpe @ 2020-01-27 19:01 UTC (permalink / raw)
  To: Jens Axboe; +Cc: jglisse, lsf-pc, linux-fsdevel, linux-mm, Benjamin LaHaise

On Tue, Jan 21, 2020 at 08:54:22PM -0700, Jens Axboe wrote:
> On 1/21/20 7:31 PM, jglisse@redhat.com wrote:
> > From: Jérôme Glisse <jglisse@redhat.com>
> > 
> > Direct I/O does pin memory through GUP (get user page) this does
> > block several mm activities like:
> >     - compaction
> >     - numa
> >     - migration
> >     ...
> > 
> > It is also troublesome if the pinned pages are actualy file back
> > pages that migth go under writeback. In which case the page can
> > not be write protected from direct-io point of view (see various
> > discussion about recent work on GUP [1]). This does happens for
> > instance if the virtual memory address use as buffer for read
> > operation is the outcome of an mmap of a regular file.
> > 
> > 
> > With direct-io or aio (asynchronous io) pages are pinned until
> > syscall completion (which depends on many factors: io size,
> > block device speed, ...). For io-uring pages can be pinned an
> > indifinite amount of time.
> > 
> > 
> > So i would like to convert direct io code (direct-io, aio and
> > io-uring) to obey mmu notifier and thus allow memory management
> > and writeback to work and behave like any other process memory.
> > 
> > For direct-io and aio this mostly gives a way to wait on syscall
> > completion. For io-uring this means that buffer might need to be
> > re-validated (ie looking up pages again to get the new set of
> > pages for the buffer). Impact for io-uring is the delay needed
> > to lookup new pages or wait on writeback (if necessary). This
> > would only happens _if_ an invalidation event happens, which it-
> > self should only happen under memory preissure or for NUMA
> > activities.
> > 
> > They are ways to minimize the impact (for instance by using the
> > mmu notifier type to ignore some invalidation cases).
> > 
> > 
> > So i would like to discuss all this during LSF, it is mostly a
> > filesystem discussion with strong tie to mm.
> 
> I'd be interested in this topic, as it pertains to io_uring. The whole
> point of registered buffers is to avoid mapping overhead, and page
> references. 

I'd also be interested as it pertains to mmu notifiers and related
which I've been involved with reworking lately. I feel others are
looking at doing different things with bio/skbs that are kind of
related to this idea so I feel it is worthwhile topic.

This proposal sounds, at a high level, quite similar to what vhost is
doing today, where they want to use copy_to_user without paying it's
cost by directly accessing kernel pages and keeping everything in sync
with notifiers.

> If we add extra overhead per operation for that, well... I'm
> assuming the above is strictly for file mapped pages? Or also page
> migration?

Generally the performance profile we see in other places is that
applications that don't touch their memory have no impact, while
things get wonky during invalidations.

However, that has assumed DMA devices where the DMA device has some
optimized HW way to manage locking.   

In vhost the performance concernes seems to revolve around locking the
CPU access thread against the mmu notifier thread.

I'm curious about Jérôme's thinking on this, particularly when you mix
in longer lifetimes of skbs and bios and whatnot. At some point the
pages must become pinned, for instance while they are submitted to a
device for DMA.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2020-01-27 19:01 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-22  2:31 [LSF/MM/BPF TOPIC] Do not pin pages for various direct-io scheme jglisse
2020-01-22  3:54 ` Jens Axboe
2020-01-22  4:57   ` Jerome Glisse
2020-01-22 11:59     ` Michal Hocko
2020-01-22 15:12       ` Jens Axboe
2020-01-22 16:54         ` Jerome Glisse
2020-01-22 17:04           ` Jens Axboe
2020-01-22 17:28             ` Jerome Glisse
2020-01-22 17:38               ` Jens Axboe
2020-01-22 17:40                 ` Jerome Glisse
2020-01-22 17:49                   ` Jens Axboe
2020-01-27 19:01   ` Jason Gunthorpe
2020-01-22  4:19 ` Dan Williams
2020-01-22  5:00   ` Jerome Glisse
2020-01-22 15:56     ` [Lsf-pc] " Dan Williams
2020-01-22 17:02       ` Jerome Glisse

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).