All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dan Williams <dan.j.williams@intel.com>
To: Jan Kara <jack@suse.cz>
Cc: linux-xfs <linux-xfs@vger.kernel.org>,
	linux-nvdimm <linux-nvdimm@lists.01.org>,
	Linux API <linux-api@vger.kernel.org>,
	Christoph Hellwig <hch@infradead.org>,
	Linux MM <linux-mm@kvack.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	linux-ext4 <linux-ext4@vger.kernel.org>
Subject: Re: Problems with VM_MIXEDMAP removal from /proc/<pid>/smaps
Date: Wed, 3 Oct 2018 08:13:37 -0700	[thread overview]
Message-ID: <CAPcyv4iJvN6_Cf6tw=5a=Uh99LfMFKU7n8QkGcz1ZaxL0Oi-3w@mail.gmail.com> (raw)
In-Reply-To: <20181003150658.GC24030@quack2.suse.cz>

On Wed, Oct 3, 2018 at 8:07 AM Jan Kara <jack@suse.cz> wrote:
>
> On Wed 03-10-18 07:38:50, Dan Williams wrote:
> > On Wed, Oct 3, 2018 at 5:51 AM Jan Kara <jack@suse.cz> wrote:
> > >
> > > On Tue 02-10-18 13:18:54, Dan Williams wrote:
> > > > On Tue, Oct 2, 2018 at 8:32 AM Jan Kara <jack@suse.cz> wrote:
> > > > >
> > > > > On Tue 02-10-18 07:52:06, Christoph Hellwig wrote:
> > > > > > On Tue, Oct 02, 2018 at 04:44:13PM +0200, Johannes Thumshirn wrote:
> > > > > > > On Tue, Oct 02, 2018 at 07:37:13AM -0700, Christoph Hellwig wrote:
> > > > > > > > No, it should not.  DAX is an implementation detail thay may change
> > > > > > > > or go away at any time.
> > > > > > >
> > > > > > > Well we had an issue with an application checking for dax, this is how
> > > > > > > we landed here in the first place.
> > > > > >
> > > > > > So what exacty is that "DAX" they are querying about (and no, I'm not
> > > > > > joking, nor being philosophical).
> > > > >
> > > > > I believe the application we are speaking about is mostly concerned about
> > > > > the memory overhead of the page cache. Think of a machine that has ~ 1TB of
> > > > > DRAM, the database running on it is about that size as well and they want
> > > > > database state stored somewhere persistently - which they may want to do by
> > > > > modifying mmaped database files if they do small updates... So they really
> > > > > want to be able to use close to all DRAM for the DB and not leave slack
> > > > > space for the kernel page cache to cache 1TB of database files.
> > > >
> > > > VM_MIXEDMAP was never a reliable indication of DAX because it could be
> > > > set for random other device-drivers that use vm_insert_mixed(). The
> > > > MAP_SYNC flag positively indicates that page cache is disabled for a
> > > > given mapping, although whether that property is due to "dax" or some
> > > > other kernel mechanics is purely an internal detail.
> > > >
> > > > I'm not opposed to faking out VM_MIXEDMAP if this broken check has
> > > > made it into production, but again, it's unreliable.
> > >
> > > So luckily this particular application wasn't widely deployed yet so we
> > > will likely get away with the vendor asking customers to update to a
> > > version not looking into smaps and parsing /proc/mounts instead.
> > >
> > > But I don't find parsing /proc/mounts that beautiful either and I'd prefer
> > > if we had a better interface for applications to query whether they can
> > > avoid page cache for mmaps or not.
> >
> > Yeah, the mount flag is not a good indicator either. I think we need
> > to follow through on the per-inode property of DAX. Darrick and I
> > discussed just allowing the property to be inherited from the parent
> > directory at file creation time. That avoids the dynamic set-up /
> > teardown races that seem intractable at this point.
> >
> > What's wrong with MAP_SYNC as a page-cache detector in the meantime?
>
> So IMHO checking for MAP_SYNC is about as reliable as checking for 'dax'
> mount option. It works now but nobody promises it will reliably detect DAX in
> future - e.g. there's nothing that prevents MAP_SYNC to work for mappings
> using pagecache if we find a sensible usecase for that.

Fair enough.

> WRT per-inode DAX property, AFAIU that inode flag is just going to be
> advisory thing - i.e., use DAX if possible. If you mount a filesystem with
> these inode flags set in a configuration which does not allow DAX to be
> used, you will still be able to access such inodes but the access will use
> page cache instead. And querying these flags should better show real
> on-disk status and not just whether DAX is used as that would result in an
> even bigger mess. So this feature seems to be somewhat orthogonal to the
> API I'm looking for.

True, I imagine once we have that flag we will be able to distinguish
the "saved" property and the "effective / live" property of DAX...
Also it's really not DAX that applications care about as much as "is
there page-cache indirection / overhead for this mapping?". That seems
to be a narrower guarantee that we can make than what "DAX" might
imply.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

WARNING: multiple messages have this Message-ID (diff)
From: Dan Williams <dan.j.williams@intel.com>
To: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@infradead.org>,
	Johannes Thumshirn <jthumshirn@suse.de>,
	Dave Jiang <dave.jiang@intel.com>,
	linux-nvdimm <linux-nvdimm@lists.01.org>,
	Linux MM <linux-mm@kvack.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	linux-ext4 <linux-ext4@vger.kernel.org>,
	linux-xfs <linux-xfs@vger.kernel.org>,
	Linux API <linux-api@vger.kernel.org>
Subject: Re: Problems with VM_MIXEDMAP removal from /proc/<pid>/smaps
Date: Wed, 3 Oct 2018 08:13:37 -0700	[thread overview]
Message-ID: <CAPcyv4iJvN6_Cf6tw=5a=Uh99LfMFKU7n8QkGcz1ZaxL0Oi-3w@mail.gmail.com> (raw)
In-Reply-To: <20181003150658.GC24030@quack2.suse.cz>

On Wed, Oct 3, 2018 at 8:07 AM Jan Kara <jack@suse.cz> wrote:
>
> On Wed 03-10-18 07:38:50, Dan Williams wrote:
> > On Wed, Oct 3, 2018 at 5:51 AM Jan Kara <jack@suse.cz> wrote:
> > >
> > > On Tue 02-10-18 13:18:54, Dan Williams wrote:
> > > > On Tue, Oct 2, 2018 at 8:32 AM Jan Kara <jack@suse.cz> wrote:
> > > > >
> > > > > On Tue 02-10-18 07:52:06, Christoph Hellwig wrote:
> > > > > > On Tue, Oct 02, 2018 at 04:44:13PM +0200, Johannes Thumshirn wrote:
> > > > > > > On Tue, Oct 02, 2018 at 07:37:13AM -0700, Christoph Hellwig wrote:
> > > > > > > > No, it should not.  DAX is an implementation detail thay may change
> > > > > > > > or go away at any time.
> > > > > > >
> > > > > > > Well we had an issue with an application checking for dax, this is how
> > > > > > > we landed here in the first place.
> > > > > >
> > > > > > So what exacty is that "DAX" they are querying about (and no, I'm not
> > > > > > joking, nor being philosophical).
> > > > >
> > > > > I believe the application we are speaking about is mostly concerned about
> > > > > the memory overhead of the page cache. Think of a machine that has ~ 1TB of
> > > > > DRAM, the database running on it is about that size as well and they want
> > > > > database state stored somewhere persistently - which they may want to do by
> > > > > modifying mmaped database files if they do small updates... So they really
> > > > > want to be able to use close to all DRAM for the DB and not leave slack
> > > > > space for the kernel page cache to cache 1TB of database files.
> > > >
> > > > VM_MIXEDMAP was never a reliable indication of DAX because it could be
> > > > set for random other device-drivers that use vm_insert_mixed(). The
> > > > MAP_SYNC flag positively indicates that page cache is disabled for a
> > > > given mapping, although whether that property is due to "dax" or some
> > > > other kernel mechanics is purely an internal detail.
> > > >
> > > > I'm not opposed to faking out VM_MIXEDMAP if this broken check has
> > > > made it into production, but again, it's unreliable.
> > >
> > > So luckily this particular application wasn't widely deployed yet so we
> > > will likely get away with the vendor asking customers to update to a
> > > version not looking into smaps and parsing /proc/mounts instead.
> > >
> > > But I don't find parsing /proc/mounts that beautiful either and I'd prefer
> > > if we had a better interface for applications to query whether they can
> > > avoid page cache for mmaps or not.
> >
> > Yeah, the mount flag is not a good indicator either. I think we need
> > to follow through on the per-inode property of DAX. Darrick and I
> > discussed just allowing the property to be inherited from the parent
> > directory at file creation time. That avoids the dynamic set-up /
> > teardown races that seem intractable at this point.
> >
> > What's wrong with MAP_SYNC as a page-cache detector in the meantime?
>
> So IMHO checking for MAP_SYNC is about as reliable as checking for 'dax'
> mount option. It works now but nobody promises it will reliably detect DAX in
> future - e.g. there's nothing that prevents MAP_SYNC to work for mappings
> using pagecache if we find a sensible usecase for that.

Fair enough.

> WRT per-inode DAX property, AFAIU that inode flag is just going to be
> advisory thing - i.e., use DAX if possible. If you mount a filesystem with
> these inode flags set in a configuration which does not allow DAX to be
> used, you will still be able to access such inodes but the access will use
> page cache instead. And querying these flags should better show real
> on-disk status and not just whether DAX is used as that would result in an
> even bigger mess. So this feature seems to be somewhat orthogonal to the
> API I'm looking for.

True, I imagine once we have that flag we will be able to distinguish
the "saved" property and the "effective / live" property of DAX...
Also it's really not DAX that applications care about as much as "is
there page-cache indirection / overhead for this mapping?". That seems
to be a narrower guarantee that we can make than what "DAX" might
imply.

WARNING: multiple messages have this Message-ID (diff)
From: Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
To: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
Cc: linux-xfs <linux-xfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	linux-nvdimm
	<linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org>,
	Linux API <linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	Christoph Hellwig <hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>,
	Linux MM <linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org>,
	linux-fsdevel
	<linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	linux-ext4 <linux-ext4-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
Subject: Re: Problems with VM_MIXEDMAP removal from /proc/<pid>/smaps
Date: Wed, 3 Oct 2018 08:13:37 -0700	[thread overview]
Message-ID: <CAPcyv4iJvN6_Cf6tw=5a=Uh99LfMFKU7n8QkGcz1ZaxL0Oi-3w@mail.gmail.com> (raw)
In-Reply-To: <20181003150658.GC24030-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org>

On Wed, Oct 3, 2018 at 8:07 AM Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> wrote:
>
> On Wed 03-10-18 07:38:50, Dan Williams wrote:
> > On Wed, Oct 3, 2018 at 5:51 AM Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> wrote:
> > >
> > > On Tue 02-10-18 13:18:54, Dan Williams wrote:
> > > > On Tue, Oct 2, 2018 at 8:32 AM Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> wrote:
> > > > >
> > > > > On Tue 02-10-18 07:52:06, Christoph Hellwig wrote:
> > > > > > On Tue, Oct 02, 2018 at 04:44:13PM +0200, Johannes Thumshirn wrote:
> > > > > > > On Tue, Oct 02, 2018 at 07:37:13AM -0700, Christoph Hellwig wrote:
> > > > > > > > No, it should not.  DAX is an implementation detail thay may change
> > > > > > > > or go away at any time.
> > > > > > >
> > > > > > > Well we had an issue with an application checking for dax, this is how
> > > > > > > we landed here in the first place.
> > > > > >
> > > > > > So what exacty is that "DAX" they are querying about (and no, I'm not
> > > > > > joking, nor being philosophical).
> > > > >
> > > > > I believe the application we are speaking about is mostly concerned about
> > > > > the memory overhead of the page cache. Think of a machine that has ~ 1TB of
> > > > > DRAM, the database running on it is about that size as well and they want
> > > > > database state stored somewhere persistently - which they may want to do by
> > > > > modifying mmaped database files if they do small updates... So they really
> > > > > want to be able to use close to all DRAM for the DB and not leave slack
> > > > > space for the kernel page cache to cache 1TB of database files.
> > > >
> > > > VM_MIXEDMAP was never a reliable indication of DAX because it could be
> > > > set for random other device-drivers that use vm_insert_mixed(). The
> > > > MAP_SYNC flag positively indicates that page cache is disabled for a
> > > > given mapping, although whether that property is due to "dax" or some
> > > > other kernel mechanics is purely an internal detail.
> > > >
> > > > I'm not opposed to faking out VM_MIXEDMAP if this broken check has
> > > > made it into production, but again, it's unreliable.
> > >
> > > So luckily this particular application wasn't widely deployed yet so we
> > > will likely get away with the vendor asking customers to update to a
> > > version not looking into smaps and parsing /proc/mounts instead.
> > >
> > > But I don't find parsing /proc/mounts that beautiful either and I'd prefer
> > > if we had a better interface for applications to query whether they can
> > > avoid page cache for mmaps or not.
> >
> > Yeah, the mount flag is not a good indicator either. I think we need
> > to follow through on the per-inode property of DAX. Darrick and I
> > discussed just allowing the property to be inherited from the parent
> > directory at file creation time. That avoids the dynamic set-up /
> > teardown races that seem intractable at this point.
> >
> > What's wrong with MAP_SYNC as a page-cache detector in the meantime?
>
> So IMHO checking for MAP_SYNC is about as reliable as checking for 'dax'
> mount option. It works now but nobody promises it will reliably detect DAX in
> future - e.g. there's nothing that prevents MAP_SYNC to work for mappings
> using pagecache if we find a sensible usecase for that.

Fair enough.

> WRT per-inode DAX property, AFAIU that inode flag is just going to be
> advisory thing - i.e., use DAX if possible. If you mount a filesystem with
> these inode flags set in a configuration which does not allow DAX to be
> used, you will still be able to access such inodes but the access will use
> page cache instead. And querying these flags should better show real
> on-disk status and not just whether DAX is used as that would result in an
> even bigger mess. So this feature seems to be somewhat orthogonal to the
> API I'm looking for.

True, I imagine once we have that flag we will be able to distinguish
the "saved" property and the "effective / live" property of DAX...
Also it's really not DAX that applications care about as much as "is
there page-cache indirection / overhead for this mapping?". That seems
to be a narrower guarantee that we can make than what "DAX" might
imply.

  reply	other threads:[~2018-10-03 15:13 UTC|newest]

Thread overview: 124+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-10-02 10:05 Problems with VM_MIXEDMAP removal from /proc/<pid>/smaps Jan Kara
2018-10-02 10:05 ` Jan Kara
2018-10-02 10:05 ` Jan Kara
2018-10-02 10:50 ` Michal Hocko
2018-10-02 10:50   ` Michal Hocko
2018-10-02 13:32   ` Jan Kara
2018-10-02 13:32     ` Jan Kara
2018-10-02 12:10 ` Johannes Thumshirn
2018-10-02 12:10   ` Johannes Thumshirn
2018-10-02 12:10   ` Johannes Thumshirn
2018-10-02 14:20   ` Johannes Thumshirn
2018-10-02 14:20     ` Johannes Thumshirn
2018-10-02 14:20     ` Johannes Thumshirn
2018-10-02 14:45     ` Christoph Hellwig
2018-10-02 14:45       ` Christoph Hellwig
2018-10-02 15:01       ` Johannes Thumshirn
2018-10-02 15:01         ` Johannes Thumshirn
2018-10-02 15:01         ` Johannes Thumshirn
2018-10-02 15:06         ` Christoph Hellwig
2018-10-02 15:06           ` Christoph Hellwig
2018-10-04 10:09           ` Johannes Thumshirn
2018-10-04 10:09             ` Johannes Thumshirn
2018-10-04 10:09             ` Johannes Thumshirn
2018-10-05  6:25             ` Christoph Hellwig
2018-10-05  6:25               ` Christoph Hellwig
2018-10-05  6:35               ` Johannes Thumshirn
2018-10-05  6:35                 ` Johannes Thumshirn
2018-10-05  6:35                 ` Johannes Thumshirn
2018-10-06  1:17                 ` Dan Williams
2018-10-06  1:17                   ` Dan Williams
2018-10-14 15:47                   ` Dan Williams
2018-10-14 15:47                     ` Dan Williams
2018-10-17 20:01                     ` Dan Williams
2018-10-18 17:43                       ` Jan Kara
2018-10-18 17:43                         ` Jan Kara
2018-10-18 19:10                         ` Dan Williams
2018-10-18 19:10                           ` Dan Williams
2018-10-19  3:01                           ` Dave Chinner
2018-10-19  3:01                             ` Dave Chinner
2018-10-02 14:29   ` Jan Kara
2018-10-02 14:29     ` Jan Kara
2018-10-02 14:29     ` Jan Kara
2018-10-02 14:37     ` Christoph Hellwig
2018-10-02 14:37       ` Christoph Hellwig
2018-10-02 14:37       ` Christoph Hellwig
2018-10-02 14:44       ` Johannes Thumshirn
2018-10-02 14:44         ` Johannes Thumshirn
2018-10-02 14:44         ` Johannes Thumshirn
2018-10-02 14:44         ` Johannes Thumshirn
2018-10-02 14:44         ` Johannes Thumshirn
2018-10-02 14:52         ` Christoph Hellwig
2018-10-02 14:52           ` Christoph Hellwig
2018-10-02 14:52           ` Christoph Hellwig
2018-10-02 15:31           ` Jan Kara
2018-10-02 15:31             ` Jan Kara
2018-10-02 15:31             ` Jan Kara
2018-10-02 20:18             ` Dan Williams
2018-10-02 20:18               ` Dan Williams
2018-10-03 12:50               ` Jan Kara
2018-10-03 12:50                 ` Jan Kara
2018-10-03 12:50                 ` Jan Kara
2018-10-03 14:38                 ` Dan Williams
2018-10-03 14:38                   ` Dan Williams
2018-10-03 15:06                   ` Jan Kara
2018-10-03 15:06                     ` Jan Kara
2018-10-03 15:06                     ` Jan Kara
2018-10-03 15:13                     ` Dan Williams [this message]
2018-10-03 15:13                       ` Dan Williams
2018-10-03 15:13                       ` Dan Williams
2018-10-03 16:44                       ` Jan Kara
2018-10-03 16:44                         ` Jan Kara
2018-10-03 16:44                         ` Jan Kara
2018-10-03 21:13                         ` Dan Williams
2018-10-03 21:13                           ` Dan Williams
2018-10-03 21:13                           ` Dan Williams
2018-10-04 10:04                         ` Johannes Thumshirn
2018-10-04 10:04                           ` Johannes Thumshirn
2018-10-04 10:04                           ` Johannes Thumshirn
2018-10-04 10:04                           ` Johannes Thumshirn
2018-10-04 10:04                           ` Johannes Thumshirn
2018-10-02 15:07       ` Jan Kara
2018-10-02 15:07         ` Jan Kara
2018-10-02 15:07         ` Jan Kara
2018-10-17 20:23     ` Jeff Moyer
2018-10-17 20:23       ` Jeff Moyer
2018-10-17 20:23       ` Jeff Moyer
2018-10-17 20:23       ` Jeff Moyer
2018-10-18  0:25       ` Dave Chinner
2018-10-18  0:25         ` Dave Chinner
2018-10-18  0:25         ` Dave Chinner
2018-10-18 14:55         ` Jan Kara
2018-10-18 14:55           ` Jan Kara
2018-10-19  0:43           ` Dave Chinner
2018-10-19  0:43             ` Dave Chinner
2018-10-19  0:43             ` Dave Chinner
2018-10-30  6:30             ` Dan Williams
2018-10-30  6:30               ` Dan Williams
2018-10-30  6:30               ` Dan Williams
2018-10-30 22:49               ` Dave Chinner
2018-10-30 22:49                 ` Dave Chinner
2018-10-30 22:49                 ` Dave Chinner
2018-10-30 22:59                 ` Dan Williams
2018-10-30 22:59                   ` Dan Williams
2018-10-30 22:59                   ` Dan Williams
2018-10-31  5:59                 ` y-goto
2018-10-31  5:59                   ` y-goto-LMvhtfratI1BDgjK7y7TUQ
2018-10-31  5:59                   ` y-goto
2018-11-01 23:00                   ` Dave Chinner
2018-11-01 23:00                     ` Dave Chinner
2018-11-01 23:00                     ` Dave Chinner
2018-11-02  1:43                     ` y-goto
2018-11-02  1:43                       ` y-goto-LMvhtfratI1BDgjK7y7TUQ
2018-11-02  1:43                       ` y-goto
2018-10-18 21:05         ` Jeff Moyer
2018-10-18 21:05           ` Jeff Moyer
2018-10-18 21:05           ` Jeff Moyer
2018-10-18 21:05           ` Jeff Moyer
2018-10-09 19:43 ` Jeff Moyer
2018-10-09 19:43   ` Jeff Moyer
2018-10-09 19:43   ` Jeff Moyer
2018-10-16  8:25   ` Jan Kara
2018-10-16  8:25     ` Jan Kara
2018-10-16 12:35     ` Jeff Moyer
2018-10-16 12:35       ` Jeff Moyer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAPcyv4iJvN6_Cf6tw=5a=Uh99LfMFKU7n8QkGcz1ZaxL0Oi-3w@mail.gmail.com' \
    --to=dan.j.williams@intel.com \
    --cc=hch@infradead.org \
    --cc=jack@suse.cz \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.