From: Dan Williams <email@example.com> To: david <firstname.lastname@example.org> Cc: Jan Kara <email@example.com>, jmoyer <firstname.lastname@example.org>, Johannes Thumshirn <email@example.com>, Dave Jiang <firstname.lastname@example.org>, linux-nvdimm <email@example.com>, Linux MM <firstname.lastname@example.org>, linux-fsdevel <email@example.com>, linux-ext4 <firstname.lastname@example.org>, linux-xfs <email@example.com>, Linux API <firstname.lastname@example.org> Subject: Re: Problems with VM_MIXEDMAP removal from /proc/<pid>/smaps Date: Tue, 30 Oct 2018 15:59:09 -0700 Message-ID: <CAPcyv4gwBJ65170Gy+EaDLpomA93kxuWTM10mZeaPYV6Rq=SRw@mail.gmail.com> (raw) In-Reply-To: <20181030224904.GT19305@dastard> On Tue, Oct 30, 2018 at 3:49 PM Dave Chinner <email@example.com> wrote: > > On Mon, Oct 29, 2018 at 11:30:41PM -0700, Dan Williams wrote: > > On Thu, Oct 18, 2018 at 5:58 PM Dave Chinner <firstname.lastname@example.org> wrote: > > > On Thu, Oct 18, 2018 at 04:55:55PM +0200, Jan Kara wrote: > > > > On Thu 18-10-18 11:25:10, Dave Chinner wrote: > > > > > On Wed, Oct 17, 2018 at 04:23:50PM -0400, Jeff Moyer wrote: > > > > > > MAP_SYNC > > > > > > - file system guarantees that metadata required to reach faulted-in file > > > > > > data is consistent on media before a write fault is completed. A > > > > > > side-effect is that the page cache will not be used for > > > > > > writably-mapped pages. > > > > > > > > > > I think you are conflating current implementation with API > > > > > requirements - MAP_SYNC doesn't guarantee anything about page cache > > > > > use. The man page definition simply says "supported only for files > > > > > supporting DAX" and that it provides certain data integrity > > > > > guarantees. It does not define the implementation. > .... > > > > With O_DIRECT the fallback to buffered IO is quite rare (at least for major > > > > filesystems) so usually people just won't notice. If fallback for > > > > MAP_DIRECT will be easy to hit, I'm not sure it would be very useful. > > > > > > Which is just like the situation where O_DIRECT on ext3 was not very > > > useful, but on other filesystems like XFS it was fully functional. > > > > > > IMO, the fact that a specific filesytem has a suboptimal fallback > > > path for an uncommon behaviour isn't an argument against MAP_DIRECT > > > as a hint - it's actually a feature. If MAP_DIRECT can't be used > > > until it's always direct access, then most filesystems wouldn't be > > > able to provide any faster paths at all. It's much better to have > > > partial functionality now than it is to never have the functionality > > > at all, and so we need to design in the flexibility we need to > > > iteratively improve implementations without needing API changes that > > > will break applications. > > > > The hard guarantee requirement still remains though because an > > application that expects combined MAP_SYNC|MAP_DIRECT semantics will > > be surprised if the MAP_DIRECT property silently disappears. > > Why would they be surprised? They won't even notice it if the > filesystem can provide MAP_SYNC without MAP_DIRECT. > > And that's the whole point. > > MAP_DIRECT is a private mapping state. So is MAP_SYNC. They are not > visible to the filesystem and the filesystem does nothing to enforce > them. If someone does something that requires the page cache (e.g. > calls do_splice_direct()) then that MAP_DIRECT mapping has a whole > heap of new work to do. And, in some cases, the filesystem may not > be able to provide MAP_DIRECT as a result.. > > IOWs, the filesystem cannot guarantee MAP_DIRECT and the > circumstances under which MAP_DIRECT will and will not work are > dynamic. If MAP_DIRECT is supposed to be a guarantee then we'll have > applications randomly segfaulting in production as things like > backups, indexers, etc run over the filesystem and do their work. > > This is why MAP_DIRECT needs to be an optimisation, not a > requirement - things will still work if MAP_DIRECT is not used. What > matters to these applications is MAP_SYNC - if we break MAP_SYNC, > then the application data integrity model is violated. That's not an > acceptible outcome. > > The problem, it seems to me, is that people are unable to separate > MAP_DIRECT and MAP_SYNC. I suspect that is because, at present, > MAP_SYNC on XFS and ext4 requires MAP_DIRECT. i.e. we can only > provide MAP_SYNC functionality on DAX mappings. However, that's a > /filesystem implementation issue/, not an API guarantee we need to > provide to userspace. > > If we implement a persistent page cache (e.g. allocate page cache > pages out of ZONE_DEVICE pmem), then filesystems like XFS and ext4 > could provide applications with the MAP_SYNC data integrity model > without MAP_DIRECT. Indeed, those filesystems would not even be able > to provide MAP_DIRECT semantics because they aren't backed by pmem. > > Hence if applications that want MAP_SYNC are hard coded > MAP_SYNC|MAP_DIRECT and we make MAP_DIRECT a hard guarantee, then > those applications are going to fail on a filesystem that provides > only MAP_SYNC. This is despite the fact the applications would > function correctly and the data integrity model would be maintained. > i.e. the failure is because applications have assumed MAP_SYNC can > only be provided by a DAX implementation, not because MAP_SYNC is > not supported. > > MAP_SYNC really isn't about DAX at all. It's about enabling a data > integrity model that requires the filesystem to provide userspace > access to CPU addressable persistent memory. DAX+MAP_DIRECT is just > one method of providing this functionality, but it's not the only > method. Our API needs to be future proof rather than an encoding of > the existing implementation limitations, otherwise apps will have to > be re-written as every new MAP_SYNC capable technology comes along. > > In summary: > > MAP_DIRECT is an access hint. > > MAP_SYNC provides a data integrity model guarantee. > > MAP_SYNC may imply MAP_DIRECT for specific implementations, > but it does not require or guarantee MAP_DIRECT. > > Let's compare that with O_DIRECT: > > O_DIRECT in an access hint. > > O_DSYNC provides a data integrity model guarantee. > > O_DSYNC may imply O_DIRECT for specific implementations, but > it does not require or guarantee O_DIRECT. > > Consistency in access and data integrity models is a good thing. DAX > and pmem is not an exception. We need to use a model we know works > and has proven itself over a long period of time. > > > I think > > it still makes some sense as a hint for apps that want to minimize > > page cache, but for the applications with a flush from userspace model > > I think that wants to be an F_SETLEASE / F_DIRECTLCK operation. This > > still gives the filesystem the option to inject page-cache at will, > > but with an application coordination point. > > Why make it more complex for applications than it needs to be? With the clarification that MAP_SYNC implies "cpu cache flush to persistent memory page-cache *or* dax to persistent memory" I think all of the concerns are addressed. I was conflating MAP_DIRECT as "no page cache indirection", but the indirection does not matter if the page cache itself is persisted.
next prev parent reply index Thread overview: 45+ messages / expand[flat|nested] mbox.gz Atom feed top 2018-10-02 10:05 Jan Kara 2018-10-02 10:50 ` Michal Hocko 2018-10-02 13:32 ` Jan Kara 2018-10-02 12:10 ` Johannes Thumshirn 2018-10-02 14:20 ` Johannes Thumshirn 2018-10-02 14:45 ` Christoph Hellwig 2018-10-02 15:01 ` Johannes Thumshirn 2018-10-02 15:06 ` Christoph Hellwig 2018-10-04 10:09 ` Johannes Thumshirn 2018-10-05 6:25 ` Christoph Hellwig 2018-10-05 6:35 ` Johannes Thumshirn 2018-10-06 1:17 ` Dan Williams 2018-10-14 15:47 ` Dan Williams 2018-10-17 20:01 ` Dan Williams 2018-10-18 17:43 ` Jan Kara 2018-10-18 19:10 ` Dan Williams 2018-10-19 3:01 ` Dave Chinner 2018-10-02 14:29 ` Jan Kara 2018-10-02 14:37 ` Christoph Hellwig 2018-10-02 14:44 ` Johannes Thumshirn 2018-10-02 14:52 ` Christoph Hellwig 2018-10-02 15:31 ` Jan Kara 2018-10-02 20:18 ` Dan Williams 2018-10-03 12:50 ` Jan Kara 2018-10-03 14:38 ` Dan Williams 2018-10-03 15:06 ` Jan Kara 2018-10-03 15:13 ` Dan Williams 2018-10-03 16:44 ` Jan Kara 2018-10-03 21:13 ` Dan Williams 2018-10-04 10:04 ` Johannes Thumshirn 2018-10-02 15:07 ` Jan Kara 2018-10-17 20:23 ` Jeff Moyer 2018-10-18 0:25 ` Dave Chinner 2018-10-18 14:55 ` Jan Kara 2018-10-19 0:43 ` Dave Chinner 2018-10-30 6:30 ` Dan Williams 2018-10-30 22:49 ` Dave Chinner 2018-10-30 22:59 ` Dan Williams [this message] 2018-10-31 5:59 ` y-goto 2018-11-01 23:00 ` Dave Chinner 2018-11-02 1:43 ` y-goto 2018-10-18 21:05 ` Jeff Moyer 2018-10-09 19:43 ` Jeff Moyer 2018-10-16 8:25 ` Jan Kara 2018-10-16 12:35 ` Jeff Moyer
Reply instructions: You may reply publically to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to='CAPcyv4gwBJ65170Gy+EaDLpomA93kxuWTM10mZeaPYV6Rq=SRw@mail.gmail.com' \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
Linux-Fsdevel Archive on lore.kernel.org Archives are clonable: git clone --mirror https://lore.kernel.org/linux-fsdevel/0 linux-fsdevel/git/0.git # If you have public-inbox 1.1+ installed, you may # initialize and index your mirror using the following commands: public-inbox-init -V2 linux-fsdevel linux-fsdevel/ https://lore.kernel.org/linux-fsdevel \ email@example.com public-inbox-index linux-fsdevel Example config snippet for mirrors Newsgroup available over NNTP: nntp://nntp.lore.kernel.org/org.kernel.vger.linux-fsdevel AGPL code for this site: git clone https://public-inbox.org/public-inbox.git