Linux-Fsdevel Archive on
 help / color / Atom feed
From: Dave Chinner <>
To: "" <>
Cc: Dan Williams <>, Jan Kara <>,
	jmoyer <>,
	Johannes Thumshirn <>,
	Dave Jiang <>,
	linux-nvdimm <>,
	Linux MM <>,
	linux-fsdevel <>,
	linux-ext4 <>,
	linux-xfs <>,
	Linux API <>
Subject: Re: Problems with VM_MIXEDMAP removal from /proc/<pid>/smaps
Date: Fri, 2 Nov 2018 10:00:12 +1100
Message-ID: <20181101230012.GC19305@dastard> (raw)
In-Reply-To: <>

On Wed, Oct 31, 2018 at 05:59:17AM +0000, wrote:
> > On Mon, Oct 29, 2018 at 11:30:41PM -0700, Dan Williams wrote:
> > > On Thu, Oct 18, 2018 at 5:58 PM Dave Chinner <> wrote:
> > In summary:
> > 
> > 	MAP_DIRECT is an access hint.
> > 
> > 	MAP_SYNC provides a data integrity model guarantee.
> > 
> > 	MAP_SYNC may imply MAP_DIRECT for specific implementations,
> > 	but it does not require or guarantee MAP_DIRECT.
> > 
> > Let's compare that with O_DIRECT:
> > 
> > 	O_DIRECT in an access hint.
> > 
> > 	O_DSYNC provides a data integrity model guarantee.
> > 
> > 	O_DSYNC may imply O_DIRECT for specific implementations, but
> > 	it does not require or guarantee O_DIRECT.
> > 
> > Consistency in access and data integrity models is a good thing. DAX
> > and pmem is not an exception. We need to use a model we know works
> > and has proven itself over a long period of time.
> Hmmm, then, I would like to know all of the reasons of breakage of MAP_DIRECT.
> (I'm not opposed to your opinion, but I need to know it.)
> In O_DIRECT case, in my understanding, the reason of breakage of O_DIRECT is 
> "wrong alignment is specified by application", right?

O_DIRECT has defined memory and offset alignment restrictions, and
will return an error to userspace when they are violated. It does
not fall back to buffered IO in this case. MAP_DIRECT has no
equivalent restriction, so IO alignment of O_DIRECT is largely
irrelevant here.

What we are talking about here is that some filesystems can only do
certain operations through buffered IO, such as block allocation or
file extension, and so silently fall back to doing them via buffered
IO even when O_DIRECT is specified. The old direct IO code used to
be full of conditionals to allow this - I think DIO_SKIP_HOLES is
only one remaining:

                 * For writes that could fill holes inside i_size on a
                 * DIO_SKIP_HOLES filesystem we forbid block creations: only
                 * overwrites are permitted. We will return early to the caller
                 * once we see an unmapped buffer head returned, and the caller
                 * will fall back to buffered I/O.
                 * Otherwise the decision is left to the get_blocks method,
                 * which may decide to handle it or also return an unmapped
                 * buffer head.
                create = dio->op == REQ_OP_WRITE;
                if (dio->flags & DIO_SKIP_HOLES) {
                        if (fs_startblk <= ((i_size_read(dio->inode) - 1) >>
                                create = 0;

Other cases like file extension cases are caught by the filesystems
before calling into the DIO code itself, so there's multiple avenues
for O_DIRECT transparently falling back to buffered IO.

This means the applications don't fail just because the filesystem
can't do a specific operation via O_DIRECT. The data writes still
succeed because they fall back to buffered IO, and the application
is blissfully unaware that the filesystem behaved that way.

> When filesystem can not use O_DIRECT and it uses page cache instead,
> then system uses more memory resource than user's expectation.

That's far better than failing unexpectedly because the app
unexpectedly came across a hole in the file (e.g. someone ran
sparsify across the filesystem).

> So, there is a side effect, and it may cause other trouble.
> (memory pressure, expected performance can not be gained, and so on ..)

Which is why people are supposed to test their systems before they
put them into production.

I've lost count of the number of times I've heard "but O_DIRECT is
supposed to make things faster!" because people don't understand
exactly what it does or means. Bypassing the page cache does not
magically make applications go faster - it puts the responsibility
for doing optimal IO on the application, not the kernel.

MAP_DIRECT will be no different. It's no guarantee that it will make
things faster, or that everything will just work as users expect
them to. It specifically places the responsibility for performing IO
in an optimal fashion on the application and the user for making
sure that it is fit for their purposes. Like O_DIRECT, using
MAP_DIRECT means "I, the application, know exactly what I'm doing,
so get out of the way as much as possible because I'm taking
responsibility for issuing IO in the most optimal manner now".

> In such case its administrator (or technical support engineer) needs to struggle to
> investigate what is the reason.

That's no different to performance problems that arise from
inappropriate use of O_DIRECT. It requires a certain level of
expertise to be able to understand and diagnose such issues.

> So, I would like to know in MAP_DIRECT case, what is the reasons? 
> I think it will be helpful for users.
> Only splice?

The filesystem can ignore MAP_DIRECT for any reason it needs to. I'm
certain that filesystem developers will try to maintain MAP_DIRECT
semantics as much as possible, but it's not going to be possible in
/all situations/ on XFS and ext4 because they simply haven't been
designed with DAX in mind. Filesystems designed specifically for
pmem and DAX might be able to provide MAP_DIRECT in all situations,
but those filesystems don't really exist yet.

This is no different to the early days of O_DIRECT. e.g.  ext3
couldn't do O_DIRECT for all operations when it was first
introduced, but over time the functionality improved as the
underlying issues were solved. If O_DIRECT was a guarantee, then
ext3 would have never supported O_DIRECT at all...

> (Maybe such document will be necessary....)

The semantics will need to be documented in the relevant man pages.


Dave Chinner

  reply index

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-10-02 10:05 Jan Kara
2018-10-02 10:50 ` Michal Hocko
2018-10-02 13:32   ` Jan Kara
2018-10-02 12:10 ` Johannes Thumshirn
2018-10-02 14:20   ` Johannes Thumshirn
2018-10-02 14:45     ` Christoph Hellwig
2018-10-02 15:01       ` Johannes Thumshirn
2018-10-02 15:06         ` Christoph Hellwig
2018-10-04 10:09           ` Johannes Thumshirn
2018-10-05  6:25             ` Christoph Hellwig
2018-10-05  6:35               ` Johannes Thumshirn
2018-10-06  1:17                 ` Dan Williams
2018-10-14 15:47                   ` Dan Williams
2018-10-17 20:01                     ` Dan Williams
2018-10-18 17:43                       ` Jan Kara
2018-10-18 19:10                         ` Dan Williams
2018-10-19  3:01                           ` Dave Chinner
2018-10-02 14:29   ` Jan Kara
2018-10-02 14:37     ` Christoph Hellwig
2018-10-02 14:44       ` Johannes Thumshirn
2018-10-02 14:52         ` Christoph Hellwig
2018-10-02 15:31           ` Jan Kara
2018-10-02 20:18             ` Dan Williams
2018-10-03 12:50               ` Jan Kara
2018-10-03 14:38                 ` Dan Williams
2018-10-03 15:06                   ` Jan Kara
2018-10-03 15:13                     ` Dan Williams
2018-10-03 16:44                       ` Jan Kara
2018-10-03 21:13                         ` Dan Williams
2018-10-04 10:04                         ` Johannes Thumshirn
2018-10-02 15:07       ` Jan Kara
2018-10-17 20:23     ` Jeff Moyer
2018-10-18  0:25       ` Dave Chinner
2018-10-18 14:55         ` Jan Kara
2018-10-19  0:43           ` Dave Chinner
2018-10-30  6:30             ` Dan Williams
2018-10-30 22:49               ` Dave Chinner
2018-10-30 22:59                 ` Dan Williams
2018-10-31  5:59                 ` y-goto
2018-11-01 23:00                   ` Dave Chinner [this message]
2018-11-02  1:43                     ` y-goto
2018-10-18 21:05         ` Jeff Moyer
2018-10-09 19:43 ` Jeff Moyer
2018-10-16  8:25   ` Jan Kara
2018-10-16 12:35     ` Jeff Moyer

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20181101230012.GC19305@dastard \ \ \ \ \ \ \ \ \ \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-Fsdevel Archive on

Archives are clonable:
	git clone --mirror linux-fsdevel/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-fsdevel linux-fsdevel/ \
	public-inbox-index linux-fsdevel

Example config snippet for mirrors

Newsgroup available over NNTP:

AGPL code for this site: git clone public-inbox