From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-fsdevel-owner@vger.kernel.org>
Received: from mx143.netapp.com ([216.240.21.24]:45411 "EHLO mx143.netapp.com"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1751357AbeCOQb4 (ORCPT <rfc822;linux-fsdevel@vger.kernel.org>);
        Thu, 15 Mar 2018 12:31:56 -0400
Subject: Re: [RFC 1/7] mm: Add new vma flag VM_LOCAL_CPU
To: Miklos Szeredi <mszeredi@redhat.com>
References: <d5db55b6-e488-986a-81b1-a3514e8eba81@netapp.com>
 <443fea57-f165-6bed-8c8a-0a32f72b9cd2@netapp.com>
 <20180313185658.GB21538@bombadil.infradead.org>
 <CAOssrKfoZKcu1Ku3YOGsoTXmdJeJy71bvQaZ6k3+r6_kD0B2Fg@mail.gmail.com>
 <b49772ef-e96e-af22-ba6d-f91a26389fab@netapp.com>
 <CAOssrKf+KJfr8anKZFqTwLNO85Fkfrw7=ZpXYi53uT++PqADbA@mail.gmail.com>
 <07cda3e5-c911-a49b-fceb-052f8ca57e66@netapp.com>
 <CAOssrKcUDNQdEoKayLPsoSNpZgtnro3u6nAQcZvOHZHO25JFag@mail.gmail.com>
CC: Matthew Wilcox <willy@infradead.org>,
        linux-fsdevel <linux-fsdevel@vger.kernel.org>,
        Ric Wheeler <rwheeler@redhat.com>,
        "Steve French" <smfrench@gmail.com>,
        Steven Whitehouse <swhiteho@redhat.com>,
        "Jefff moyer" <jmoyer@redhat.com>, Sage Weil <sweil@redhat.com>,
        Jan Kara <jack@suse.cz>, Amir Goldstein <amir73il@gmail.com>,
        Andy Rudof <andy.rudoff@intel.com>,
        Anna Schumaker <Anna.Schumaker@netapp.com>,
        "Amit Golander" <Amit.Golander@netapp.com>,
        Sagi Manole <sagim@netapp.com>,
        "Shachar Sharon" <Shachar.Sharon@netapp.com>
From: Boaz Harrosh <boazh@netapp.com>
Message-ID: <9bfa8d53-5693-7953-9dcf-79a8cff0b97f@netapp.com>
Date: Thu, 15 Mar 2018 18:30:47 +0200
MIME-Version: 1.0
In-Reply-To: <CAOssrKcUDNQdEoKayLPsoSNpZgtnro3u6nAQcZvOHZHO25JFag@mail.gmail.com>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>

On 15/03/18 18:10, Miklos Szeredi wrote:
<>
>> This can never properly translate. Even a simple file on disk
>> is linear for the app (unaligned buffer) but is scattered on
>> multiple blocks on disk. Yes perhaps networking can somewhat work
>> if you pre/post-pend the headers you need.
>> And you restrict direct IO semantics on everything specially the APP
>> with my system you can do zero copy on any kind of application
> 
> I lost you there, sorry.
> 
> How will your scheme deal with alignment issues better than my scheme?
> 

In my pmem case easy memcpy. This will not work if you need to go
to hard disk I agree. (Which is not a priority for me)

>> And this assumes networking or some-device. Which means going back
>> to the Kernel, which in ZUFS rules you must return -ASYNC to the zuf
>> and complete in a background ASYNC thread. This is an order of a magnitude
>> higher latency then what I showed here.
> 
> Indeed.
> 
>> And what about the SYNC copy from Server to APP. With a pipe you
>> are forcing me to go back to the Kernel to execute the copy. which
>> means two more crossings. This will double the round trips.
> 
> If you are trying to minimize the roundtrips, why not cache the
> mapping in the kernel?  That way you don't necessarily have to go to
> userspace at all.  With readahead logic, the server will be able to
> preload the mapping before the reads happen, and you basically get the
> same speed as an in-kernel fs would.
> 

Yes as I said that was my first approach. But at the end this is
always a special workload optimization but in the general case this
actually adds a round trip and a huge complexity that always comes
to bite you.

> Also don't quite understand how are you planning to generalize beyond
> the pmem case.  The interface is ready for that, sure.  But what about
> caching?  Will that be done in the server?   Does that make sense?
> Kernel already has page cache for that purpose and userspace cache
> won't ever be as good as kernel cache.
> 

I explained about that. We can easily support page-cache in zufs
here what I wrote:
> Please note that it will be very easy with this API to also support
> page-cache for FSs that wants it like the network FSs you said.
> The FS will set a bit in the fs_register call to say that it would
> rather use page cache. These type of FSs will run on a different
> kind of BDI which says "Yes page cache please". All the IO entry
> vectors point to the generic_iter API and instead we implement
> read/write_pages(). At read/write_pages() we do the exact same OP_READ/WRITE
> like today. map the cache pages to the zus VM, despatch, return, release page_lock.
> all is happy. Any one wanting to contribute this is very welcome.

Yes please no caching at the zus level that's insane

> Thanks,
> Miklos
> 

Thanks
Boaz