All of lore.kernel.org
 help / color / mirror / Atom feed
From: Maxim Patlasov <mpatlasov@parallels.com>
To: Miklos Szeredi <miklos@szeredi.hu>
Cc: <riel@redhat.com>, Kirill Korotaev <dev@parallels.com>,
	Pavel Emelianov <xemul@parallels.com>,
	fuse-devel <fuse-devel@lists.sourceforge.net>,
	Brian Foster <bfoster@redhat.com>,
	Kernel Mailing List <linux-kernel@vger.kernel.org>,
	James Bottomley <jbottomley@parallels.com>, <linux-mm@kvack.org>,
	Al Viro <viro@zeniv.linux.org.uk>,
	Linux-Fsdevel <linux-fsdevel@vger.kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	<fengguang.wu@intel.com>, <devel@openvz.org>,
	Mel Gorman <mgorman@suse.de>
Subject: Re: [PATCH 10/16] fuse: Implement writepages callback
Date: Tue, 3 Sep 2013 20:02:50 +0400	[thread overview]
Message-ID: <5226082A.1050104@parallels.com> (raw)
In-Reply-To: <20130903103132.GA7191@tucsk.piliscsaba.szeredi.hu>

09/03/2013 02:31 PM, Miklos Szeredi пишет:
> On Fri, Aug 30, 2013 at 06:50:18PM +0400, Maxim Patlasov wrote:
>> Hi Miklos,
>>
>> 08/30/2013 02:12 PM, Miklos Szeredi пишет:
>>> On Fri, Aug 09, 2013 at 07:02:12PM +0400, Maxim Patlasov wrote:
>>>> 08/06/2013 08:25 PM, Miklos Szeredi пишет:
>>>>> Hmm.  Direct IO on an mmaped file will do get_user_pages() which will
>>>>> do the necessary page fault magic and ->page_mkwrite() will be called.
>>>>> At least AFAICS.
>>>> Yes, I agree.
>>>>
>>>>> The page cannot become dirty through a memory mapping without first
>>>>> switching the pte from read-only to read-write first.  Page accounting
>>>>> logic relies on this too.  The other way the page can become dirty is
>>>>> through write(2) on the fs.  But we do get notified about that too.
>>>> Yes, that's correct, but I don't understand why you disregard two
>>>> other cases of marking page dirty (both related to direct AIO read
>>> >from a file to a memory region mmap-ed to a fuse file):
>>>> 1. dio_bio_submit() -->
>>>>        bio_set_pages_dirty() -->
>>>>          set_page_dirty_lock()
>>>>
>>>> 2. dio_bio_complete() -->
>>>>        bio_check_pages_dirty() -->
>>>>           bio_dirty_fn() -->
>>>>              bio_set_pages_dirty() -->
>>>>                 set_page_dirty_lock()
>>>>
>>>> As soon as a page became dirty through a memory mapping (exactly as
>>>> you explained), nothing would prevent it to be written-back. And
>>>> fuse will call end_page_writeback almost immediately after copying
>>>> the real page to a temporary one. Then dio_bio_submit may re-dirty
>>>> page speculatively w/o notifying fuse. And again, since then nothing
>>>> would prevent it to be written-back once more. Hence we can end up
>>>> in more then one temporary page in fuse write-back. And similar
>>>> concern for dio_bio_complete() re-dirty.
>>>>
>>>> This make me think that we do need fuse_page_is_writeback() in
>>>> fuse_writepages_fill(). But it shouldn't be harmful because it will
>>>> no-op practically always due to waiting for fuse writeback in
>>>> ->page_mkwrite() and in course of handling write(2).
>>> The problem is: if we need it in ->writepages, we need it in ->writepage too.
>>> And that's where we can't have it because it would deadlock in reclaim.
>> I thought we're protected from the deadlock by the following chunk
>> (in the very beginning of fuse_writepage):
>>
>>> +	if (fuse_page_is_writeback(inode, page->index)) {
>>> +		if (wbc->sync_mode != WB_SYNC_ALL) {
>>> +			redirty_page_for_writepage(wbc, page);
>>> +			return 0;
>>> +		}
>>> +		fuse_wait_on_page_writeback(inode, page->index);
>>> +	}
>> Because reclaimer will never call us with WB_SYNC_ALL. Did I miss
>> something?
> Yeah, we could have that in ->writepage() too.  And apparently that would work,
> reclaim would just leave us alone.
>
> Then there's sync(2) which does do WB_SYNC_ALL. Yet for an unprivileged fuse
> mount we don't want ->writepages() to block because that's a quite clear DoS
> issue.

Yes, I agree, but those cases (when sync(2) coincides with a page under 
fuse writeback originated by flusher coinciding with those direct AIO 
read redirty) should be very rare. I'd suggest to go on and put up with 
it for now: unprivileged users won't be able to use writeback_cache 
option until sysad enables allow_wbcache in fusermount.

>
> So we are left with this:

Yes. May we implement it as a separate fix after inclusion of this 
patch-set?

>
>>> There's a way to work around this:
>>>
>>>     - if the request is still in queue, just update it with the contents of
>>>       the new page
>>>
>>>     - if the request already in userspace, create a new reqest, but only let
>>>       userspace have it once the previous request for the same page
>>>       completes, so the ordering is not messed up
>>>
>>> But that's a lot of hairy code.
>> Is it exactly how NFS solves similar problem?
> NFS will apparently just block if there's a request outstanding and we are in
> WB_SYNC_ALL mode.  Which is somewhat simpler.

Yes, indeed.

Thanks,
Maxim

WARNING: multiple messages have this Message-ID (diff)
From: Maxim Patlasov <mpatlasov@parallels.com>
To: Miklos Szeredi <miklos@szeredi.hu>
Cc: <riel@redhat.com>, Kirill Korotaev <dev@parallels.com>,
	Pavel Emelianov <xemul@parallels.com>,
	fuse-devel <fuse-devel@lists.sourceforge.net>,
	Brian Foster <bfoster@redhat.com>,
	Kernel Mailing List <linux-kernel@vger.kernel.org>,
	James Bottomley <jbottomley@parallels.com>, <linux-mm@kvack.org>,
	Al Viro <viro@zeniv.linux.org.uk>,
	Linux-Fsdevel <linux-fsdevel@vger.kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	<fengguang.wu@intel.com>, <devel@openvz.org>,
	Mel Gorman <mgorman@suse.de>
Subject: Re: [PATCH 10/16] fuse: Implement writepages callback
Date: Tue, 3 Sep 2013 20:02:50 +0400	[thread overview]
Message-ID: <5226082A.1050104@parallels.com> (raw)
In-Reply-To: <20130903103132.GA7191@tucsk.piliscsaba.szeredi.hu>

09/03/2013 02:31 PM, Miklos Szeredi пишет:
> On Fri, Aug 30, 2013 at 06:50:18PM +0400, Maxim Patlasov wrote:
>> Hi Miklos,
>>
>> 08/30/2013 02:12 PM, Miklos Szeredi пишет:
>>> On Fri, Aug 09, 2013 at 07:02:12PM +0400, Maxim Patlasov wrote:
>>>> 08/06/2013 08:25 PM, Miklos Szeredi пишет:
>>>>> Hmm.  Direct IO on an mmaped file will do get_user_pages() which will
>>>>> do the necessary page fault magic and ->page_mkwrite() will be called.
>>>>> At least AFAICS.
>>>> Yes, I agree.
>>>>
>>>>> The page cannot become dirty through a memory mapping without first
>>>>> switching the pte from read-only to read-write first.  Page accounting
>>>>> logic relies on this too.  The other way the page can become dirty is
>>>>> through write(2) on the fs.  But we do get notified about that too.
>>>> Yes, that's correct, but I don't understand why you disregard two
>>>> other cases of marking page dirty (both related to direct AIO read
>>> >from a file to a memory region mmap-ed to a fuse file):
>>>> 1. dio_bio_submit() -->
>>>>        bio_set_pages_dirty() -->
>>>>          set_page_dirty_lock()
>>>>
>>>> 2. dio_bio_complete() -->
>>>>        bio_check_pages_dirty() -->
>>>>           bio_dirty_fn() -->
>>>>              bio_set_pages_dirty() -->
>>>>                 set_page_dirty_lock()
>>>>
>>>> As soon as a page became dirty through a memory mapping (exactly as
>>>> you explained), nothing would prevent it to be written-back. And
>>>> fuse will call end_page_writeback almost immediately after copying
>>>> the real page to a temporary one. Then dio_bio_submit may re-dirty
>>>> page speculatively w/o notifying fuse. And again, since then nothing
>>>> would prevent it to be written-back once more. Hence we can end up
>>>> in more then one temporary page in fuse write-back. And similar
>>>> concern for dio_bio_complete() re-dirty.
>>>>
>>>> This make me think that we do need fuse_page_is_writeback() in
>>>> fuse_writepages_fill(). But it shouldn't be harmful because it will
>>>> no-op practically always due to waiting for fuse writeback in
>>>> ->page_mkwrite() and in course of handling write(2).
>>> The problem is: if we need it in ->writepages, we need it in ->writepage too.
>>> And that's where we can't have it because it would deadlock in reclaim.
>> I thought we're protected from the deadlock by the following chunk
>> (in the very beginning of fuse_writepage):
>>
>>> +	if (fuse_page_is_writeback(inode, page->index)) {
>>> +		if (wbc->sync_mode != WB_SYNC_ALL) {
>>> +			redirty_page_for_writepage(wbc, page);
>>> +			return 0;
>>> +		}
>>> +		fuse_wait_on_page_writeback(inode, page->index);
>>> +	}
>> Because reclaimer will never call us with WB_SYNC_ALL. Did I miss
>> something?
> Yeah, we could have that in ->writepage() too.  And apparently that would work,
> reclaim would just leave us alone.
>
> Then there's sync(2) which does do WB_SYNC_ALL. Yet for an unprivileged fuse
> mount we don't want ->writepages() to block because that's a quite clear DoS
> issue.

Yes, I agree, but those cases (when sync(2) coincides with a page under 
fuse writeback originated by flusher coinciding with those direct AIO 
read redirty) should be very rare. I'd suggest to go on and put up with 
it for now: unprivileged users won't be able to use writeback_cache 
option until sysad enables allow_wbcache in fusermount.

>
> So we are left with this:

Yes. May we implement it as a separate fix after inclusion of this 
patch-set?

>
>>> There's a way to work around this:
>>>
>>>     - if the request is still in queue, just update it with the contents of
>>>       the new page
>>>
>>>     - if the request already in userspace, create a new reqest, but only let
>>>       userspace have it once the previous request for the same page
>>>       completes, so the ordering is not messed up
>>>
>>> But that's a lot of hairy code.
>> Is it exactly how NFS solves similar problem?
> NFS will apparently just block if there's a request outstanding and we are in
> WB_SYNC_ALL mode.  Which is somewhat simpler.

Yes, indeed.

Thanks,
Maxim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)
From: Maxim Patlasov <mpatlasov@parallels.com>
To: Miklos Szeredi <miklos@szeredi.hu>
Cc: riel@redhat.com, Kirill Korotaev <dev@parallels.com>,
	Pavel Emelianov <xemul@parallels.com>,
	fuse-devel <fuse-devel@lists.sourceforge.net>,
	Brian Foster <bfoster@redhat.com>,
	Kernel Mailing List <linux-kernel@vger.kernel.org>,
	James Bottomley <jbottomley@parallels.com>,
	linux-mm@kvack.org, Al Viro <viro@zeniv.linux.org.uk>,
	Linux-Fsdevel <linux-fsdevel@vger.kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	fengguang.wu@intel.com, devel@openvz.org,
	Mel Gorman <mgorman@suse.de>
Subject: Re: [PATCH 10/16] fuse: Implement writepages callback
Date: Tue, 3 Sep 2013 20:02:50 +0400	[thread overview]
Message-ID: <5226082A.1050104@parallels.com> (raw)
In-Reply-To: <20130903103132.GA7191@tucsk.piliscsaba.szeredi.hu>

09/03/2013 02:31 PM, Miklos Szeredi D?D,N?DuN?:
> On Fri, Aug 30, 2013 at 06:50:18PM +0400, Maxim Patlasov wrote:
>> Hi Miklos,
>>
>> 08/30/2013 02:12 PM, Miklos Szeredi D?D,N?DuN?:
>>> On Fri, Aug 09, 2013 at 07:02:12PM +0400, Maxim Patlasov wrote:
>>>> 08/06/2013 08:25 PM, Miklos Szeredi D?D,N?DuN?:
>>>>> Hmm.  Direct IO on an mmaped file will do get_user_pages() which will
>>>>> do the necessary page fault magic and ->page_mkwrite() will be called.
>>>>> At least AFAICS.
>>>> Yes, I agree.
>>>>
>>>>> The page cannot become dirty through a memory mapping without first
>>>>> switching the pte from read-only to read-write first.  Page accounting
>>>>> logic relies on this too.  The other way the page can become dirty is
>>>>> through write(2) on the fs.  But we do get notified about that too.
>>>> Yes, that's correct, but I don't understand why you disregard two
>>>> other cases of marking page dirty (both related to direct AIO read
>>> >from a file to a memory region mmap-ed to a fuse file):
>>>> 1. dio_bio_submit() -->
>>>>        bio_set_pages_dirty() -->
>>>>          set_page_dirty_lock()
>>>>
>>>> 2. dio_bio_complete() -->
>>>>        bio_check_pages_dirty() -->
>>>>           bio_dirty_fn() -->
>>>>              bio_set_pages_dirty() -->
>>>>                 set_page_dirty_lock()
>>>>
>>>> As soon as a page became dirty through a memory mapping (exactly as
>>>> you explained), nothing would prevent it to be written-back. And
>>>> fuse will call end_page_writeback almost immediately after copying
>>>> the real page to a temporary one. Then dio_bio_submit may re-dirty
>>>> page speculatively w/o notifying fuse. And again, since then nothing
>>>> would prevent it to be written-back once more. Hence we can end up
>>>> in more then one temporary page in fuse write-back. And similar
>>>> concern for dio_bio_complete() re-dirty.
>>>>
>>>> This make me think that we do need fuse_page_is_writeback() in
>>>> fuse_writepages_fill(). But it shouldn't be harmful because it will
>>>> no-op practically always due to waiting for fuse writeback in
>>>> ->page_mkwrite() and in course of handling write(2).
>>> The problem is: if we need it in ->writepages, we need it in ->writepage too.
>>> And that's where we can't have it because it would deadlock in reclaim.
>> I thought we're protected from the deadlock by the following chunk
>> (in the very beginning of fuse_writepage):
>>
>>> +	if (fuse_page_is_writeback(inode, page->index)) {
>>> +		if (wbc->sync_mode != WB_SYNC_ALL) {
>>> +			redirty_page_for_writepage(wbc, page);
>>> +			return 0;
>>> +		}
>>> +		fuse_wait_on_page_writeback(inode, page->index);
>>> +	}
>> Because reclaimer will never call us with WB_SYNC_ALL. Did I miss
>> something?
> Yeah, we could have that in ->writepage() too.  And apparently that would work,
> reclaim would just leave us alone.
>
> Then there's sync(2) which does do WB_SYNC_ALL. Yet for an unprivileged fuse
> mount we don't want ->writepages() to block because that's a quite clear DoS
> issue.

Yes, I agree, but those cases (when sync(2) coincides with a page under 
fuse writeback originated by flusher coinciding with those direct AIO 
read redirty) should be very rare. I'd suggest to go on and put up with 
it for now: unprivileged users won't be able to use writeback_cache 
option until sysad enables allow_wbcache in fusermount.

>
> So we are left with this:

Yes. May we implement it as a separate fix after inclusion of this 
patch-set?

>
>>> There's a way to work around this:
>>>
>>>     - if the request is still in queue, just update it with the contents of
>>>       the new page
>>>
>>>     - if the request already in userspace, create a new reqest, but only let
>>>       userspace have it once the previous request for the same page
>>>       completes, so the ordering is not messed up
>>>
>>> But that's a lot of hairy code.
>> Is it exactly how NFS solves similar problem?
> NFS will apparently just block if there's a request outstanding and we are in
> WB_SYNC_ALL mode.  Which is somewhat simpler.

Yes, indeed.

Thanks,
Maxim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2013-09-03 16:03 UTC|newest]

Thread overview: 79+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-06-29 17:41 [PATCH v5 00/16] fuse: An attempt to implement a write-back cache policy Maxim Patlasov
2013-06-29 17:41 ` Maxim Patlasov
2013-06-29 17:42 ` [PATCH 01/16] fuse: Linking file to inode helper Maxim Patlasov
2013-06-29 17:42   ` Maxim Patlasov
2013-06-29 17:42 ` [PATCH 02/16] fuse: Getting file for writeback helper Maxim Patlasov
2013-06-29 17:42   ` Maxim Patlasov
2013-06-29 17:42 ` [PATCH 03/16] fuse: Prepare to handle short reads Maxim Patlasov
2013-06-29 17:42   ` Maxim Patlasov
2013-06-29 17:42   ` Maxim Patlasov
2013-06-29 17:42 ` [PATCH 04/16] fuse: Prepare to handle multiple pages in writeback Maxim Patlasov
2013-06-29 17:42   ` Maxim Patlasov
2013-06-29 17:42 ` [PATCH 05/16] fuse: Connection bit for enabling writeback Maxim Patlasov
2013-06-29 17:42   ` Maxim Patlasov
2013-06-29 17:44 ` [PATCH 06/16] fuse: Trust kernel i_size only - v4 Maxim Patlasov
2013-06-29 17:44   ` Maxim Patlasov
2013-06-29 17:44 ` [PATCH 07/16] fuse: Trust kernel i_mtime only Maxim Patlasov
2013-06-29 17:44   ` Maxim Patlasov
2013-07-11 16:14   ` [PATCH 07/16] fuse: Trust kernel i_mtime only -v2 Maxim Patlasov
2013-06-29 17:45 ` [PATCH 08/16] fuse: Flush files on wb close Maxim Patlasov
2013-06-29 17:45   ` Maxim Patlasov
2013-07-11 16:18   ` [PATCH 08/16] fuse: Flush files on wb close -v2 Maxim Patlasov
2013-06-29 17:45 ` [PATCH 09/16] fuse: restructure fuse_readpage() Maxim Patlasov
2013-06-29 17:45   ` Maxim Patlasov
2013-06-29 17:45   ` Maxim Patlasov
2013-06-29 17:45 ` [PATCH 10/16] fuse: Implement writepages callback Maxim Patlasov
2013-06-29 17:45   ` Maxim Patlasov
2013-07-19 16:50   ` Miklos Szeredi
2013-07-19 16:50     ` Miklos Szeredi
2013-08-02 15:40     ` Maxim Patlasov
2013-08-02 15:40       ` Maxim Patlasov
2013-08-02 15:40       ` Maxim Patlasov
2013-08-06 16:25       ` Miklos Szeredi
2013-08-06 16:25         ` Miklos Szeredi
2013-08-06 16:26         ` Eric Boxer
2013-08-09 15:02         ` Maxim Patlasov
2013-08-09 15:02           ` Maxim Patlasov
2013-08-09 15:02           ` Maxim Patlasov
2013-08-30 10:12           ` Miklos Szeredi
2013-08-30 10:12             ` Miklos Szeredi
2013-08-30 10:12             ` Miklos Szeredi
2013-08-30 14:50             ` Maxim Patlasov
2013-08-30 14:50               ` Maxim Patlasov
2013-08-30 14:50               ` Maxim Patlasov
2013-09-03 10:31               ` Miklos Szeredi
2013-09-03 10:31                 ` Miklos Szeredi
2013-09-03 10:31                 ` Miklos Szeredi
2013-09-03 16:02                 ` Maxim Patlasov [this message]
2013-09-03 16:02                   ` Maxim Patlasov
2013-09-03 16:02                   ` Maxim Patlasov
2013-06-29 17:45 ` [PATCH 11/16] fuse: Implement write_begin/write_end callbacks Maxim Patlasov
2013-06-29 17:45   ` Maxim Patlasov
2013-06-29 17:46 ` [PATCH 12/16] fuse: fuse_writepage_locked() should wait on writeback Maxim Patlasov
2013-06-29 17:46   ` Maxim Patlasov
2013-06-29 17:46 ` [PATCH 13/16] fuse: fuse_flush() " Maxim Patlasov
2013-06-29 17:46   ` Maxim Patlasov
2013-06-29 17:46 ` [PATCH 14/16] fuse: Fix O_DIRECT operations vs cached writeback misorder - v2 Maxim Patlasov
2013-06-29 17:46   ` Maxim Patlasov
2013-06-29 17:47 ` [PATCH 15/16] fuse: Turn writeback cache on Maxim Patlasov
2013-06-29 17:47   ` Maxim Patlasov
2013-06-29 17:48 ` [PATCH 16/16] mm: strictlimit feature Maxim Patlasov
2013-06-29 17:48   ` Maxim Patlasov
2013-07-01 21:16   ` Andrew Morton
2013-07-01 21:16     ` Andrew Morton
2013-07-02  8:33     ` Maxim Patlasov
2013-07-02  8:33       ` Maxim Patlasov
2013-07-02  8:33       ` Maxim Patlasov
2013-07-02 17:44   ` [PATCH] mm: strictlimit feature -v2 Maxim Patlasov
2013-07-02 17:44     ` Maxim Patlasov
2013-07-02 19:38     ` Andrew Morton
2013-07-02 19:38       ` Andrew Morton
2013-07-03 11:01       ` Maxim Patlasov
2013-07-03 11:01         ` Maxim Patlasov
2013-07-03 11:01         ` Maxim Patlasov
2013-07-03 23:16         ` Jan Kara
2013-07-03 23:16           ` Jan Kara
2013-07-03 23:16           ` Jan Kara
2013-07-05 13:14           ` Maxim Patlasov
2013-07-05 13:14             ` Maxim Patlasov
2013-07-05 13:14             ` Maxim Patlasov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5226082A.1050104@parallels.com \
    --to=mpatlasov@parallels.com \
    --cc=akpm@linux-foundation.org \
    --cc=bfoster@redhat.com \
    --cc=dev@parallels.com \
    --cc=devel@openvz.org \
    --cc=fengguang.wu@intel.com \
    --cc=fuse-devel@lists.sourceforge.net \
    --cc=jbottomley@parallels.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=miklos@szeredi.hu \
    --cc=riel@redhat.com \
    --cc=viro@zeniv.linux.org.uk \
    --cc=xemul@parallels.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.