All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jan Kara <jack@suse.cz>
To: Dan Williams <dan.j.williams@intel.com>
Cc: Michal Hocko <mhocko@suse.cz>, Jan Kara <jack@suse.cz>,
	linux-nvdimm <linux-nvdimm@lists.01.org>,
	Christoph Hellwig <hch@infradead.org>,
	Linux MM <linux-mm@kvack.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>
Subject: Re: Problems with VM_MIXEDMAP removal from /proc/<pid>/smaps
Date: Thu, 18 Oct 2018 19:43:00 +0200	[thread overview]
Message-ID: <20181018174300.GT23493@quack2.suse.cz> (raw)
In-Reply-To: <CAPcyv4jt_w-89+m4w=FcN0oF3axiGqPBTHfEcWwdhnr12_=17Q@mail.gmail.com>

On Wed 17-10-18 13:01:15, Dan Williams wrote:
> On Sun, Oct 14, 2018 at 8:47 AM Dan Williams <dan.j.williams@intel.com> wrote:
> >
> > On Fri, Oct 5, 2018 at 6:17 PM Dan Williams <dan.j.williams@intel.com> wrote:
> > >
> > > On Thu, Oct 4, 2018 at 11:35 PM Johannes Thumshirn <jthumshirn@suse.de> wrote:
> > > >
> > > > On Thu, Oct 04, 2018 at 11:25:24PM -0700, Christoph Hellwig wrote:
> > > > > Since when is an article on some website a promise (of what exactly)
> > > > > by linux kernel developers?
> > > >
> > > > Let's stop it here, this doesn't make any sort of forward progress.
> > > >
> > >
> > > I do think there is some progress we can make if we separate DAX as an
> > > access mechanism vs DAX as a resource utilization contract. My attempt
> > > at representing Christoph's position is that the kernel should not be
> > > advertising / making access mechanism guarantees. That makes sense.
> > > Even with MAP_SYNC+DAX the kernel reserves the right to write-protect
> > > mappings at will and trap access into a kernel handler. Additionally,
> > > whether read(2) / write(2) does anything different behind the scenes
> > > in DAX mode, or not should be irrelevant to the application.
> > >
> > > That said what is certainly not irrelevant is a kernel giving
> > > userspace visibility and control into resource utilization. Jan's
> > > MADV_DIRECT_ACCESS let's the application make assumptions about page
> > > cache utilization, we just need to another mechanism to read if a
> > > mapping is effectively already in that state.
> >
> > I thought more about this today while reviewing the virtio-pmem driver
> > that will behave mostly like a DAX-capable pmem device except it will
> > be implemented by passing host page cache through to the guest as a
> > pmem device with a paravirtualized / asynchronous flush interface.
> > MAP_SYNC obviously needs to be disabled for this case, but still need
> > allow to some semblance of DAX operation to save allocating page cache
> > in the guest. The need to explicitly clarify the state of DAX is
> > growing with the different nuances of DAX operation.
> >
> > Lets use a new MAP_DIRECT flag to positively assert that a given
> > mmap() call is setting up a memory mapping without page-cache or
> > buffered indirection. To be clear not my original MAP_DIRECT proposal
> > from a while back, instead just a flag to mmap() that causes the
> > mapping attempt to fail if there is any software buffering fronting
> > the memory mapping, or any requirement for software to manage flushing
> > outside of pushing writes through the cpu cache. This way, if we ever
> > extend MAP_SYNC for a buffered use case we can still definitely assert
> > that the mapping is "direct". So, MAP_DIRECT would fail for
> > traditional non-DAX block devices, and for this new virtio-pmem case.
> > It would also fail for any pmem device where we cannot assert that the
> > platform will take care of flushing write-pending-queues on power-loss
> > events.
> 
> After letting this set for a few days I think I'm back to liking
> MADV_DIRECT_ACCESS more since madvise() is more closely related to the
> page-cache management than mmap. It does not solve the query vs enable
> problem, but it's still a step towards giving applications what they
> want with respect to resource expectations.

Yeah, I don't have a strong opinion wrt mmap flag vs madvise flag.

> Perhaps a new syscall to retrieve the effective advice for a range?
> 
>      int madvice(void *addr, size_t length, int *advice);

After some thought, I'm not 100% sure this is really needed. I know about
apps that want to make sure DRAM is not consumed - for those mmap / madvise
flag is fine if it returns error in case the feature cannot be provided.
Most other apps don't care whether DAX is on or off. So this call would be
needed only if someone wanted to behave differently depending on whether
DAX is used or not. And although I can imagine some application like that,
I'm not sure how real that is...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

WARNING: multiple messages have this Message-ID (diff)
From: Jan Kara <jack@suse.cz>
To: Dan Williams <dan.j.williams@intel.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>,
	Christoph Hellwig <hch@infradead.org>, Jan Kara <jack@suse.cz>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Linux MM <linux-mm@kvack.org>,
	linux-nvdimm <linux-nvdimm@lists.01.org>,
	Michal Hocko <mhocko@suse.cz>
Subject: Re: Problems with VM_MIXEDMAP removal from /proc/<pid>/smaps
Date: Thu, 18 Oct 2018 19:43:00 +0200	[thread overview]
Message-ID: <20181018174300.GT23493@quack2.suse.cz> (raw)
In-Reply-To: <CAPcyv4jt_w-89+m4w=FcN0oF3axiGqPBTHfEcWwdhnr12_=17Q@mail.gmail.com>

On Wed 17-10-18 13:01:15, Dan Williams wrote:
> On Sun, Oct 14, 2018 at 8:47 AM Dan Williams <dan.j.williams@intel.com> wrote:
> >
> > On Fri, Oct 5, 2018 at 6:17 PM Dan Williams <dan.j.williams@intel.com> wrote:
> > >
> > > On Thu, Oct 4, 2018 at 11:35 PM Johannes Thumshirn <jthumshirn@suse.de> wrote:
> > > >
> > > > On Thu, Oct 04, 2018 at 11:25:24PM -0700, Christoph Hellwig wrote:
> > > > > Since when is an article on some website a promise (of what exactly)
> > > > > by linux kernel developers?
> > > >
> > > > Let's stop it here, this doesn't make any sort of forward progress.
> > > >
> > >
> > > I do think there is some progress we can make if we separate DAX as an
> > > access mechanism vs DAX as a resource utilization contract. My attempt
> > > at representing Christoph's position is that the kernel should not be
> > > advertising / making access mechanism guarantees. That makes sense.
> > > Even with MAP_SYNC+DAX the kernel reserves the right to write-protect
> > > mappings at will and trap access into a kernel handler. Additionally,
> > > whether read(2) / write(2) does anything different behind the scenes
> > > in DAX mode, or not should be irrelevant to the application.
> > >
> > > That said what is certainly not irrelevant is a kernel giving
> > > userspace visibility and control into resource utilization. Jan's
> > > MADV_DIRECT_ACCESS let's the application make assumptions about page
> > > cache utilization, we just need to another mechanism to read if a
> > > mapping is effectively already in that state.
> >
> > I thought more about this today while reviewing the virtio-pmem driver
> > that will behave mostly like a DAX-capable pmem device except it will
> > be implemented by passing host page cache through to the guest as a
> > pmem device with a paravirtualized / asynchronous flush interface.
> > MAP_SYNC obviously needs to be disabled for this case, but still need
> > allow to some semblance of DAX operation to save allocating page cache
> > in the guest. The need to explicitly clarify the state of DAX is
> > growing with the different nuances of DAX operation.
> >
> > Lets use a new MAP_DIRECT flag to positively assert that a given
> > mmap() call is setting up a memory mapping without page-cache or
> > buffered indirection. To be clear not my original MAP_DIRECT proposal
> > from a while back, instead just a flag to mmap() that causes the
> > mapping attempt to fail if there is any software buffering fronting
> > the memory mapping, or any requirement for software to manage flushing
> > outside of pushing writes through the cpu cache. This way, if we ever
> > extend MAP_SYNC for a buffered use case we can still definitely assert
> > that the mapping is "direct". So, MAP_DIRECT would fail for
> > traditional non-DAX block devices, and for this new virtio-pmem case.
> > It would also fail for any pmem device where we cannot assert that the
> > platform will take care of flushing write-pending-queues on power-loss
> > events.
> 
> After letting this set for a few days I think I'm back to liking
> MADV_DIRECT_ACCESS more since madvise() is more closely related to the
> page-cache management than mmap. It does not solve the query vs enable
> problem, but it's still a step towards giving applications what they
> want with respect to resource expectations.

Yeah, I don't have a strong opinion wrt mmap flag vs madvise flag.

> Perhaps a new syscall to retrieve the effective advice for a range?
> 
>      int madvice(void *addr, size_t length, int *advice);

After some thought, I'm not 100% sure this is really needed. I know about
apps that want to make sure DRAM is not consumed - for those mmap / madvise
flag is fine if it returns error in case the feature cannot be provided.
Most other apps don't care whether DAX is on or off. So this call would be
needed only if someone wanted to behave differently depending on whether
DAX is used or not. And although I can imagine some application like that,
I'm not sure how real that is...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

  reply	other threads:[~2018-10-18 17:43 UTC|newest]

Thread overview: 124+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-10-02 10:05 Problems with VM_MIXEDMAP removal from /proc/<pid>/smaps Jan Kara
2018-10-02 10:05 ` Jan Kara
2018-10-02 10:05 ` Jan Kara
2018-10-02 10:50 ` Michal Hocko
2018-10-02 10:50   ` Michal Hocko
2018-10-02 13:32   ` Jan Kara
2018-10-02 13:32     ` Jan Kara
2018-10-02 12:10 ` Johannes Thumshirn
2018-10-02 12:10   ` Johannes Thumshirn
2018-10-02 12:10   ` Johannes Thumshirn
2018-10-02 14:20   ` Johannes Thumshirn
2018-10-02 14:20     ` Johannes Thumshirn
2018-10-02 14:20     ` Johannes Thumshirn
2018-10-02 14:45     ` Christoph Hellwig
2018-10-02 14:45       ` Christoph Hellwig
2018-10-02 15:01       ` Johannes Thumshirn
2018-10-02 15:01         ` Johannes Thumshirn
2018-10-02 15:01         ` Johannes Thumshirn
2018-10-02 15:06         ` Christoph Hellwig
2018-10-02 15:06           ` Christoph Hellwig
2018-10-04 10:09           ` Johannes Thumshirn
2018-10-04 10:09             ` Johannes Thumshirn
2018-10-04 10:09             ` Johannes Thumshirn
2018-10-05  6:25             ` Christoph Hellwig
2018-10-05  6:25               ` Christoph Hellwig
2018-10-05  6:35               ` Johannes Thumshirn
2018-10-05  6:35                 ` Johannes Thumshirn
2018-10-05  6:35                 ` Johannes Thumshirn
2018-10-06  1:17                 ` Dan Williams
2018-10-06  1:17                   ` Dan Williams
2018-10-14 15:47                   ` Dan Williams
2018-10-14 15:47                     ` Dan Williams
2018-10-17 20:01                     ` Dan Williams
2018-10-18 17:43                       ` Jan Kara [this message]
2018-10-18 17:43                         ` Jan Kara
2018-10-18 19:10                         ` Dan Williams
2018-10-18 19:10                           ` Dan Williams
2018-10-19  3:01                           ` Dave Chinner
2018-10-19  3:01                             ` Dave Chinner
2018-10-02 14:29   ` Jan Kara
2018-10-02 14:29     ` Jan Kara
2018-10-02 14:29     ` Jan Kara
2018-10-02 14:37     ` Christoph Hellwig
2018-10-02 14:37       ` Christoph Hellwig
2018-10-02 14:37       ` Christoph Hellwig
2018-10-02 14:44       ` Johannes Thumshirn
2018-10-02 14:44         ` Johannes Thumshirn
2018-10-02 14:44         ` Johannes Thumshirn
2018-10-02 14:44         ` Johannes Thumshirn
2018-10-02 14:44         ` Johannes Thumshirn
2018-10-02 14:52         ` Christoph Hellwig
2018-10-02 14:52           ` Christoph Hellwig
2018-10-02 14:52           ` Christoph Hellwig
2018-10-02 15:31           ` Jan Kara
2018-10-02 15:31             ` Jan Kara
2018-10-02 15:31             ` Jan Kara
2018-10-02 20:18             ` Dan Williams
2018-10-02 20:18               ` Dan Williams
2018-10-03 12:50               ` Jan Kara
2018-10-03 12:50                 ` Jan Kara
2018-10-03 12:50                 ` Jan Kara
2018-10-03 14:38                 ` Dan Williams
2018-10-03 14:38                   ` Dan Williams
2018-10-03 15:06                   ` Jan Kara
2018-10-03 15:06                     ` Jan Kara
2018-10-03 15:06                     ` Jan Kara
2018-10-03 15:13                     ` Dan Williams
2018-10-03 15:13                       ` Dan Williams
2018-10-03 15:13                       ` Dan Williams
2018-10-03 16:44                       ` Jan Kara
2018-10-03 16:44                         ` Jan Kara
2018-10-03 16:44                         ` Jan Kara
2018-10-03 21:13                         ` Dan Williams
2018-10-03 21:13                           ` Dan Williams
2018-10-03 21:13                           ` Dan Williams
2018-10-04 10:04                         ` Johannes Thumshirn
2018-10-04 10:04                           ` Johannes Thumshirn
2018-10-04 10:04                           ` Johannes Thumshirn
2018-10-04 10:04                           ` Johannes Thumshirn
2018-10-04 10:04                           ` Johannes Thumshirn
2018-10-02 15:07       ` Jan Kara
2018-10-02 15:07         ` Jan Kara
2018-10-02 15:07         ` Jan Kara
2018-10-17 20:23     ` Jeff Moyer
2018-10-17 20:23       ` Jeff Moyer
2018-10-17 20:23       ` Jeff Moyer
2018-10-17 20:23       ` Jeff Moyer
2018-10-18  0:25       ` Dave Chinner
2018-10-18  0:25         ` Dave Chinner
2018-10-18  0:25         ` Dave Chinner
2018-10-18 14:55         ` Jan Kara
2018-10-18 14:55           ` Jan Kara
2018-10-19  0:43           ` Dave Chinner
2018-10-19  0:43             ` Dave Chinner
2018-10-19  0:43             ` Dave Chinner
2018-10-30  6:30             ` Dan Williams
2018-10-30  6:30               ` Dan Williams
2018-10-30  6:30               ` Dan Williams
2018-10-30 22:49               ` Dave Chinner
2018-10-30 22:49                 ` Dave Chinner
2018-10-30 22:49                 ` Dave Chinner
2018-10-30 22:59                 ` Dan Williams
2018-10-30 22:59                   ` Dan Williams
2018-10-30 22:59                   ` Dan Williams
2018-10-31  5:59                 ` y-goto
2018-10-31  5:59                   ` y-goto-LMvhtfratI1BDgjK7y7TUQ
2018-10-31  5:59                   ` y-goto
2018-11-01 23:00                   ` Dave Chinner
2018-11-01 23:00                     ` Dave Chinner
2018-11-01 23:00                     ` Dave Chinner
2018-11-02  1:43                     ` y-goto
2018-11-02  1:43                       ` y-goto-LMvhtfratI1BDgjK7y7TUQ
2018-11-02  1:43                       ` y-goto
2018-10-18 21:05         ` Jeff Moyer
2018-10-18 21:05           ` Jeff Moyer
2018-10-18 21:05           ` Jeff Moyer
2018-10-18 21:05           ` Jeff Moyer
2018-10-09 19:43 ` Jeff Moyer
2018-10-09 19:43   ` Jeff Moyer
2018-10-09 19:43   ` Jeff Moyer
2018-10-16  8:25   ` Jan Kara
2018-10-16  8:25     ` Jan Kara
2018-10-16 12:35     ` Jeff Moyer
2018-10-16 12:35       ` Jeff Moyer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20181018174300.GT23493@quack2.suse.cz \
    --to=jack@suse.cz \
    --cc=dan.j.williams@intel.com \
    --cc=hch@infradead.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=mhocko@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.