linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Matthew Wilcox <willy@infradead.org>
To: Miklos Szeredi <mszeredi@redhat.com>
Cc: Boaz Harrosh <boazh@netapp.com>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Ric Wheeler <rwheeler@redhat.com>,
	Steve French <smfrench@gmail.com>,
	Steven Whitehouse <swhiteho@redhat.com>,
	Jefff moyer <jmoyer@redhat.com>, Sage Weil <sweil@redhat.com>,
	Jan Kara <jack@suse.cz>, Amir Goldstein <amir73il@gmail.com>,
	Andy Rudof <andy.rudoff@intel.com>,
	Anna Schumaker <Anna.Schumaker@netapp.com>,
	Amit Golander <Amit.Golander@netapp.com>,
	Sagi Manole <sagim@netapp.com>,
	Shachar Sharon <Shachar.Sharon@netapp.com>
Subject: Re: [RFC 1/7] mm: Add new vma flag VM_LOCAL_CPU
Date: Wed, 14 Mar 2018 04:17:50 -0700	[thread overview]
Message-ID: <20180314111750.GA29631@bombadil.infradead.org> (raw)
In-Reply-To: <CAOssrKfoZKcu1Ku3YOGsoTXmdJeJy71bvQaZ6k3+r6_kD0B2Fg@mail.gmail.com>

On Wed, Mar 14, 2018 at 09:20:57AM +0100, Miklos Szeredi wrote:
> On Tue, Mar 13, 2018 at 7:56 PM, Matthew Wilcox <willy@infradead.org> wrote:
> > On Tue, Mar 13, 2018 at 07:15:46PM +0200, Boaz Harrosh wrote:
> >> On a call to mmap an mmap provider (like an FS) can put
> >> this flag on vma->vm_flags.
> >>
> >> This tells the Kernel that the vma will be used from a single
> >> core only and therefore invalidation of PTE(s) need not a
> >> wide CPU scheduling
> >>
> >> The motivation of this flag is the ZUFS project where we want
> >> to optimally map user-application buffers into a user-mode-server
> >> execute the operation and efficiently unmap.
> >
> > I've been looking at something similar, and I prefer my approach,
> > although I'm not nearly as far along with my implementation as you are.
> >
> > My approach is also to add a vm_flags bit, tentatively called VM_NOTLB.
> > The page fault handler refuses to insert any TLB entries into the process
> > address space.  But follow_page_mask() will return the appropriate struct
> > page for it.  This should be enough for O_DIRECT accesses to work as
> > you'll get the appropriate scatterlists built.
> >
> > I suspect Boaz has already done a lot of thinking about this and doesn't
> > need the explanation, but here's how it looks for anyone following along
> > at home:
> >
> > Process A calls read().
> > Kernel allocates a page cache page for it and calls the filesystem through
> >   ->readpages (or ->readpage).
> > Filesystem calls the managing process to get the data for that page.
> > Managing process draws a pentagram and summons Beelzebub (or runs Perl;
> >   whichever you find more scary).
> > Managing process notifies the filesystem that the page is now full of data.
> > Filesystem marks the page as being Uptodate and unlocks it.
> > Process was waiting on the page lock, wakes up and copies the data from the
> >   page cache into userspace.  read() is complete.
> >
> > What we're concerned about here is what to do after the managing process
> > tells the kernel that the read is complete.  Clearly allowing the managing
> > process continued access to the page is Bad as the page may be freed by the
> > page cache and then reused for something else.  Doing a TLB shootdown is
> > expensive.  So Boaz's approach is to have the process promise that it won't
> > have any other thread look at it.  My approach is to never allow the page
> > to have load/store access from userspace; it can only be passed to other
> > system calls.
> 
> This all seems to revolve around the fact that userspace fs server
> process needs to copy something into userspace client's buffer, right?
> 
> Instead of playing with memory mappings, why not just tell the kernel
> *what* to copy?
> 
> While in theory not as generic, I don't see any real limitations (you
> don't actually need the current contents of the buffer in the read
> case and vica verse in the write case).
> 
> And we already have an interface for this: splice(2).  What am I
> missing?  What's the killer argument in favor of the above messing
> with tlb caches etc, instead of just letting the kernel do the dirty
> work.

Great question.  You're completely right that the question is how to tell
the kernel what to copy.  The problem is that splice() can only write to
the first page of a pipe.  So you need one pipe per outstanding request,
which can easily turn into thousands of file descriptors.  If we enhanced
splice() so it could write to any page in a pipe, then I think splice()
would be the perfect interface.

  reply	other threads:[~2018-03-14 11:17 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-03-13 17:14 [RFC 0/7] first draft of ZUFS - the Kernel part Boaz Harrosh
2018-03-13 17:15 ` [RFC 1/7] mm: Add new vma flag VM_LOCAL_CPU Boaz Harrosh
2018-03-13 18:56   ` Matthew Wilcox
2018-03-14  8:20     ` Miklos Szeredi
2018-03-14 11:17       ` Matthew Wilcox [this message]
2018-03-14 11:31         ` Miklos Szeredi
2018-03-14 11:45           ` Matthew Wilcox
2018-03-14 14:49             ` Miklos Szeredi
2018-03-14 14:57               ` Matthew Wilcox
2018-03-14 15:39                 ` Miklos Szeredi
     [not found]                   ` <CAON-v2ygEDCn90C9t-zadjsd5GRgj0ECqntQSDDtO_Zjk=KoVw@mail.gmail.com>
2018-03-14 16:48                     ` Matthew Wilcox
2018-03-14 21:41       ` Boaz Harrosh
2018-03-15  8:47         ` Miklos Szeredi
2018-03-15 15:27           ` Boaz Harrosh
2018-03-15 15:34             ` Matthew Wilcox
2018-03-15 15:58               ` Boaz Harrosh
2018-03-15 16:10             ` Miklos Szeredi
2018-03-15 16:30               ` Boaz Harrosh
2018-03-15 20:42                 ` Miklos Szeredi
2018-04-25 12:21                   ` Boaz Harrosh
2018-05-07 10:46                     ` Miklos Szeredi
2018-03-13 17:17 ` [RFC 2/7] fs: Add the ZUF filesystem to the build + License Boaz Harrosh
2018-03-13 20:16   ` Andreas Dilger
2018-03-14 17:21     ` Boaz Harrosh
2018-03-15  4:21       ` Andreas Dilger
2018-03-15 13:58         ` Boaz Harrosh
2018-03-13 17:18 ` [RFC 3/7] zuf: Preliminary Documentation Boaz Harrosh
2018-03-13 20:32   ` Randy Dunlap
2018-03-14 18:01     ` Boaz Harrosh
2018-03-14 19:16       ` Randy Dunlap
2018-03-13 17:22 ` [RFC 4/7] zuf: zuf-rootfs && zuf-core Boaz Harrosh
2018-03-13 17:36   ` Boaz Harrosh
2018-03-14 12:56     ` Nikolay Borisov
2018-03-14 18:34       ` Boaz Harrosh
2018-03-13 17:25 ` [RFC 5/7] zus: Devices && mounting Boaz Harrosh
2018-03-13 17:38   ` Boaz Harrosh
2018-03-13 17:28 ` [RFC 6/7] zuf: Filesystem operations Boaz Harrosh
2018-03-13 17:39   ` Boaz Harrosh
2018-03-13 17:32 ` [RFC 7/7] zuf: Write/Read && mmap implementation Boaz Harrosh

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180314111750.GA29631@bombadil.infradead.org \
    --to=willy@infradead.org \
    --cc=Amit.Golander@netapp.com \
    --cc=Anna.Schumaker@netapp.com \
    --cc=Shachar.Sharon@netapp.com \
    --cc=amir73il@gmail.com \
    --cc=andy.rudoff@intel.com \
    --cc=boazh@netapp.com \
    --cc=jack@suse.cz \
    --cc=jmoyer@redhat.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=mszeredi@redhat.com \
    --cc=rwheeler@redhat.com \
    --cc=sagim@netapp.com \
    --cc=smfrench@gmail.com \
    --cc=sweil@redhat.com \
    --cc=swhiteho@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).