From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1750881AbbESTSZ (ORCPT ); Tue, 19 May 2015 15:18:25 -0400 Received: from mail.phunq.net ([184.71.0.62]:50505 "EHLO starbase.phunq.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750728AbbESTSW (ORCPT ); Tue, 19 May 2015 15:18:22 -0400 Message-ID: <555B8C79.4090909@phunq.net> Date: Tue, 19 May 2015 12:18:17 -0700 From: Daniel Phillips User-Agent: Mozilla/5.0 (X11; Linux i686; rv:31.0) Gecko/20100101 Thunderbird/31.6.0 MIME-Version: 1.0 To: Jan Kara CC: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, tux3@tux3.org, OGAWA Hirofumi , Rik van Riel , OGAWA Hirofumi Subject: Re: [FYI] tux3: Core changes References: <8f886f13-6550-4322-95be-93244ae61045@phunq.net> <55545C2F.8040207@phunq.net> <20150519140045.GA16313@quack.suse.cz> In-Reply-To: <20150519140045.GA16313@quack.suse.cz> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Jan, On 05/19/2015 07:00 AM, Jan Kara wrote: > On Thu 14-05-15 01:26:23, Daniel Phillips wrote: >> Hi Rik, >> >> Our linux-tux3 tree currently currently carries this 652 line diff >> against core, to make Tux3 work. This is mainly by Hirofumi, except >> the fs-writeback.c hook, which is by me. The main part you may be >> interested in is rmap.c, which addresses the issues raised at the >> 2013 Linux Storage Filesystem and MM Summit 2015 in San Francisco.[1] >> >> LSFMM: Page forking >> http://lwn.net/Articles/548091/ >> >> This is just a FYI. An upcoming Tux3 report will be a tour of the page >> forking design and implementation. For now, this is just to give a >> general sense of what we have done. We heard there are concerns about >> how ptrace will work. I really am not familiar with the issue, could >> you please explain what you were thinking of there? > So here are a few things I find problematic about page forking (besides > the cases with elevated page_count already discussed in this thread - there > I believe that anything more complex than "wait for the IO instead of > forking when page has elevated use count" isn't going to work. There are > too many users depending on too subtle details of the behavior...). Some > of them are actually mentioned in the above LWN article: > > When you create a copy of a page and replace it in the radix tree, nobody > in mm subsystem is aware that oldpage may be under writeback. That causes > interesting issues: > * truncate_inode_pages() can finish before all IO for the file is finished. > So far filesystems rely on the fact that once truncate_inode_pages() > finishes all running IO against the file is completed and new cannot be > submitted. We do not use truncate_inode_pages because of issues like that. We use some truncate helpers, which were available in some cases, or else had to be implemented in Tux3 to make everything work properly. The details are Hirofumi's stomping grounds. I am pretty sure that his solution is good and tight, or Tux3 would not pass its torture tests. > * Writeback can come and try to write newpage while oldpage is still under > IO. Then you'll have two IOs against one block which has undefined > results. Those writebacks only come from Tux3 (or indirectly from fs-writeback, through our writeback) so we are able to ensure that a dirty block is only written once. (If redirtied, the block will fork so two dirty blocks are written in two successive deltas.) > * filemap_fdatawait() called from fsync() has additional problem that it is > not aware of oldpage and thus may return although IO hasn't finished yet. We do not use filemap_fdatawait, instead, we wait on completion of our own writeback, which is under our control. > I understand that Tux3 may avoid these issues due to some other mechanisms > it internally has but if page forking should get into mm subsystem, the > above must work. It does work, and by example, it does not need a lot of code to make it work, but the changes are not trivial. Tux3's delta writeback model will not suit everyone, so you can't just lift our code and add it to Ext4. Using it in Ext4 would require a per-inode writeback model, which looks practical to me but far from a weekend project. Maybe something to consider for Ext5. It is the job of new designs like Tux3 to chase after that final drop of performance, not our trusty Ext4 workhorse. Though stranger things have happened - as I recall, Ext4 had O(n) directory operations at one time. Fixing that was not easy, but we did it because we had to. Fixing Ext4's write performance is not urgent by comparison, and the barrier is high, you would want jbd3 for one thing. I think the meta-question you are asking is, where is the second user for this new CoW functionality? With a possible implication that if there is no second user then Tux3 cannot be merged. Is that is the question? Regards, Daniel