From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757119Ab3ICQDD (ORCPT ); Tue, 3 Sep 2013 12:03:03 -0400 Received: from relay.parallels.com ([195.214.232.42]:58475 "EHLO relay.parallels.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753680Ab3ICQDB (ORCPT ); Tue, 3 Sep 2013 12:03:01 -0400 Message-ID: <5226082A.1050104@parallels.com> Date: Tue, 3 Sep 2013 20:02:50 +0400 From: Maxim Patlasov User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130801 Thunderbird/17.0.8 MIME-Version: 1.0 To: Miklos Szeredi CC: , Kirill Korotaev , Pavel Emelianov , fuse-devel , Brian Foster , Kernel Mailing List , James Bottomley , , Al Viro , Linux-Fsdevel , Andrew Morton , , , Mel Gorman Subject: Re: [PATCH 10/16] fuse: Implement writepages callback References: <20130629172211.20175.70154.stgit@maximpc.sw.ru> <20130629174525.20175.18987.stgit@maximpc.sw.ru> <20130719165037.GA18358@tucsk.piliscsaba.szeredi.hu> <51FBD2DF.50506@parallels.com> <52050474.8040608@parallels.com> <20130830101207.GD19636@tucsk.piliscsaba.szeredi.hu> <5220B12A.1060008@parallels.com> <20130903103132.GA7191@tucsk.piliscsaba.szeredi.hu> In-Reply-To: <20130903103132.GA7191@tucsk.piliscsaba.szeredi.hu> Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 8bit X-Originating-IP: [10.30.17.2] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org 09/03/2013 02:31 PM, Miklos Szeredi пишет: > On Fri, Aug 30, 2013 at 06:50:18PM +0400, Maxim Patlasov wrote: >> Hi Miklos, >> >> 08/30/2013 02:12 PM, Miklos Szeredi пишет: >>> On Fri, Aug 09, 2013 at 07:02:12PM +0400, Maxim Patlasov wrote: >>>> 08/06/2013 08:25 PM, Miklos Szeredi пишет: >>>>> Hmm. Direct IO on an mmaped file will do get_user_pages() which will >>>>> do the necessary page fault magic and ->page_mkwrite() will be called. >>>>> At least AFAICS. >>>> Yes, I agree. >>>> >>>>> The page cannot become dirty through a memory mapping without first >>>>> switching the pte from read-only to read-write first. Page accounting >>>>> logic relies on this too. The other way the page can become dirty is >>>>> through write(2) on the fs. But we do get notified about that too. >>>> Yes, that's correct, but I don't understand why you disregard two >>>> other cases of marking page dirty (both related to direct AIO read >>> >from a file to a memory region mmap-ed to a fuse file): >>>> 1. dio_bio_submit() --> >>>> bio_set_pages_dirty() --> >>>> set_page_dirty_lock() >>>> >>>> 2. dio_bio_complete() --> >>>> bio_check_pages_dirty() --> >>>> bio_dirty_fn() --> >>>> bio_set_pages_dirty() --> >>>> set_page_dirty_lock() >>>> >>>> As soon as a page became dirty through a memory mapping (exactly as >>>> you explained), nothing would prevent it to be written-back. And >>>> fuse will call end_page_writeback almost immediately after copying >>>> the real page to a temporary one. Then dio_bio_submit may re-dirty >>>> page speculatively w/o notifying fuse. And again, since then nothing >>>> would prevent it to be written-back once more. Hence we can end up >>>> in more then one temporary page in fuse write-back. And similar >>>> concern for dio_bio_complete() re-dirty. >>>> >>>> This make me think that we do need fuse_page_is_writeback() in >>>> fuse_writepages_fill(). But it shouldn't be harmful because it will >>>> no-op practically always due to waiting for fuse writeback in >>>> ->page_mkwrite() and in course of handling write(2). >>> The problem is: if we need it in ->writepages, we need it in ->writepage too. >>> And that's where we can't have it because it would deadlock in reclaim. >> I thought we're protected from the deadlock by the following chunk >> (in the very beginning of fuse_writepage): >> >>> + if (fuse_page_is_writeback(inode, page->index)) { >>> + if (wbc->sync_mode != WB_SYNC_ALL) { >>> + redirty_page_for_writepage(wbc, page); >>> + return 0; >>> + } >>> + fuse_wait_on_page_writeback(inode, page->index); >>> + } >> Because reclaimer will never call us with WB_SYNC_ALL. Did I miss >> something? > Yeah, we could have that in ->writepage() too. And apparently that would work, > reclaim would just leave us alone. > > Then there's sync(2) which does do WB_SYNC_ALL. Yet for an unprivileged fuse > mount we don't want ->writepages() to block because that's a quite clear DoS > issue. Yes, I agree, but those cases (when sync(2) coincides with a page under fuse writeback originated by flusher coinciding with those direct AIO read redirty) should be very rare. I'd suggest to go on and put up with it for now: unprivileged users won't be able to use writeback_cache option until sysad enables allow_wbcache in fusermount. > > So we are left with this: Yes. May we implement it as a separate fix after inclusion of this patch-set? > >>> There's a way to work around this: >>> >>> - if the request is still in queue, just update it with the contents of >>> the new page >>> >>> - if the request already in userspace, create a new reqest, but only let >>> userspace have it once the previous request for the same page >>> completes, so the ordering is not messed up >>> >>> But that's a lot of hairy code. >> Is it exactly how NFS solves similar problem? > NFS will apparently just block if there's a request outstanding and we are in > WB_SYNC_ALL mode. Which is somewhat simpler. Yes, indeed. Thanks, Maxim From mboxrd@z Thu Jan 1 00:00:00 1970 From: Maxim Patlasov Subject: Re: [PATCH 10/16] fuse: Implement writepages callback Date: Tue, 3 Sep 2013 20:02:50 +0400 Message-ID: <5226082A.1050104@parallels.com> References: <20130629172211.20175.70154.stgit@maximpc.sw.ru> <20130629174525.20175.18987.stgit@maximpc.sw.ru> <20130719165037.GA18358@tucsk.piliscsaba.szeredi.hu> <51FBD2DF.50506@parallels.com> <52050474.8040608@parallels.com> <20130830101207.GD19636@tucsk.piliscsaba.szeredi.hu> <5220B12A.1060008@parallels.com> <20130903103132.GA7191@tucsk.piliscsaba.szeredi.hu> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: quoted-printable Cc: , Kirill Korotaev , Pavel Emelianov , fuse-devel , Brian Foster , Kernel Mailing List , James Bottomley , , Al Viro , Linux-Fsdevel , Andrew Morton , , , Mel Gorman To: Miklos Szeredi Return-path: In-Reply-To: <20130903103132.GA7191@tucsk.piliscsaba.szeredi.hu> Sender: owner-linux-mm@kvack.org List-Id: linux-fsdevel.vger.kernel.org 09/03/2013 02:31 PM, Miklos Szeredi =D0=BF=D0=B8=D1=88=D0=B5=D1=82: > On Fri, Aug 30, 2013 at 06:50:18PM +0400, Maxim Patlasov wrote: >> Hi Miklos, >> >> 08/30/2013 02:12 PM, Miklos Szeredi =D0=BF=D0=B8=D1=88=D0=B5=D1=82: >>> On Fri, Aug 09, 2013 at 07:02:12PM +0400, Maxim Patlasov wrote: >>>> 08/06/2013 08:25 PM, Miklos Szeredi =D0=BF=D0=B8=D1=88=D0=B5=D1=82: >>>>> Hmm. Direct IO on an mmaped file will do get_user_pages() which wi= ll >>>>> do the necessary page fault magic and ->page_mkwrite() will be call= ed. >>>>> At least AFAICS. >>>> Yes, I agree. >>>> >>>>> The page cannot become dirty through a memory mapping without first >>>>> switching the pte from read-only to read-write first. Page account= ing >>>>> logic relies on this too. The other way the page can become dirty = is >>>>> through write(2) on the fs. But we do get notified about that too. >>>> Yes, that's correct, but I don't understand why you disregard two >>>> other cases of marking page dirty (both related to direct AIO read >>> >from a file to a memory region mmap-ed to a fuse file): >>>> 1. dio_bio_submit() --> >>>> bio_set_pages_dirty() --> >>>> set_page_dirty_lock() >>>> >>>> 2. dio_bio_complete() --> >>>> bio_check_pages_dirty() --> >>>> bio_dirty_fn() --> >>>> bio_set_pages_dirty() --> >>>> set_page_dirty_lock() >>>> >>>> As soon as a page became dirty through a memory mapping (exactly as >>>> you explained), nothing would prevent it to be written-back. And >>>> fuse will call end_page_writeback almost immediately after copying >>>> the real page to a temporary one. Then dio_bio_submit may re-dirty >>>> page speculatively w/o notifying fuse. And again, since then nothing >>>> would prevent it to be written-back once more. Hence we can end up >>>> in more then one temporary page in fuse write-back. And similar >>>> concern for dio_bio_complete() re-dirty. >>>> >>>> This make me think that we do need fuse_page_is_writeback() in >>>> fuse_writepages_fill(). But it shouldn't be harmful because it will >>>> no-op practically always due to waiting for fuse writeback in >>>> ->page_mkwrite() and in course of handling write(2). >>> The problem is: if we need it in ->writepages, we need it in ->writep= age too. >>> And that's where we can't have it because it would deadlock in reclai= m. >> I thought we're protected from the deadlock by the following chunk >> (in the very beginning of fuse_writepage): >> >>> + if (fuse_page_is_writeback(inode, page->index)) { >>> + if (wbc->sync_mode !=3D WB_SYNC_ALL) { >>> + redirty_page_for_writepage(wbc, page); >>> + return 0; >>> + } >>> + fuse_wait_on_page_writeback(inode, page->index); >>> + } >> Because reclaimer will never call us with WB_SYNC_ALL. Did I miss >> something? > Yeah, we could have that in ->writepage() too. And apparently that wou= ld work, > reclaim would just leave us alone. > > Then there's sync(2) which does do WB_SYNC_ALL. Yet for an unprivileged= fuse > mount we don't want ->writepages() to block because that's a quite clea= r DoS > issue. Yes, I agree, but those cases (when sync(2) coincides with a page under=20 fuse writeback originated by flusher coinciding with those direct AIO=20 read redirty) should be very rare. I'd suggest to go on and put up with=20 it for now: unprivileged users won't be able to use writeback_cache=20 option until sysad enables allow_wbcache in fusermount. > > So we are left with this: Yes. May we implement it as a separate fix after inclusion of this=20 patch-set? > >>> There's a way to work around this: >>> >>> - if the request is still in queue, just update it with the conte= nts of >>> the new page >>> >>> - if the request already in userspace, create a new reqest, but o= nly let >>> userspace have it once the previous request for the same page >>> completes, so the ordering is not messed up >>> >>> But that's a lot of hairy code. >> Is it exactly how NFS solves similar problem? > NFS will apparently just block if there's a request outstanding and we = are in > WB_SYNC_ALL mode. Which is somewhat simpler. Yes, indeed. Thanks, Maxim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx131.postini.com [74.125.245.131]) by kanga.kvack.org (Postfix) with SMTP id CE35C6B0034 for ; Tue, 3 Sep 2013 12:02:53 -0400 (EDT) Message-ID: <5226082A.1050104@parallels.com> Date: Tue, 3 Sep 2013 20:02:50 +0400 From: Maxim Patlasov MIME-Version: 1.0 Subject: Re: [PATCH 10/16] fuse: Implement writepages callback References: <20130629172211.20175.70154.stgit@maximpc.sw.ru> <20130629174525.20175.18987.stgit@maximpc.sw.ru> <20130719165037.GA18358@tucsk.piliscsaba.szeredi.hu> <51FBD2DF.50506@parallels.com> <52050474.8040608@parallels.com> <20130830101207.GD19636@tucsk.piliscsaba.szeredi.hu> <5220B12A.1060008@parallels.com> <20130903103132.GA7191@tucsk.piliscsaba.szeredi.hu> In-Reply-To: <20130903103132.GA7191@tucsk.piliscsaba.szeredi.hu> Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org List-ID: To: Miklos Szeredi Cc: riel@redhat.com, Kirill Korotaev , Pavel Emelianov , fuse-devel , Brian Foster , Kernel Mailing List , James Bottomley , linux-mm@kvack.org, Al Viro , Linux-Fsdevel , Andrew Morton , fengguang.wu@intel.com, devel@openvz.org, Mel Gorman 09/03/2013 02:31 PM, Miklos Szeredi D?D,N?DuN?: > On Fri, Aug 30, 2013 at 06:50:18PM +0400, Maxim Patlasov wrote: >> Hi Miklos, >> >> 08/30/2013 02:12 PM, Miklos Szeredi D?D,N?DuN?: >>> On Fri, Aug 09, 2013 at 07:02:12PM +0400, Maxim Patlasov wrote: >>>> 08/06/2013 08:25 PM, Miklos Szeredi D?D,N?DuN?: >>>>> Hmm. Direct IO on an mmaped file will do get_user_pages() which will >>>>> do the necessary page fault magic and ->page_mkwrite() will be called. >>>>> At least AFAICS. >>>> Yes, I agree. >>>> >>>>> The page cannot become dirty through a memory mapping without first >>>>> switching the pte from read-only to read-write first. Page accounting >>>>> logic relies on this too. The other way the page can become dirty is >>>>> through write(2) on the fs. But we do get notified about that too. >>>> Yes, that's correct, but I don't understand why you disregard two >>>> other cases of marking page dirty (both related to direct AIO read >>> >from a file to a memory region mmap-ed to a fuse file): >>>> 1. dio_bio_submit() --> >>>> bio_set_pages_dirty() --> >>>> set_page_dirty_lock() >>>> >>>> 2. dio_bio_complete() --> >>>> bio_check_pages_dirty() --> >>>> bio_dirty_fn() --> >>>> bio_set_pages_dirty() --> >>>> set_page_dirty_lock() >>>> >>>> As soon as a page became dirty through a memory mapping (exactly as >>>> you explained), nothing would prevent it to be written-back. And >>>> fuse will call end_page_writeback almost immediately after copying >>>> the real page to a temporary one. Then dio_bio_submit may re-dirty >>>> page speculatively w/o notifying fuse. And again, since then nothing >>>> would prevent it to be written-back once more. Hence we can end up >>>> in more then one temporary page in fuse write-back. And similar >>>> concern for dio_bio_complete() re-dirty. >>>> >>>> This make me think that we do need fuse_page_is_writeback() in >>>> fuse_writepages_fill(). But it shouldn't be harmful because it will >>>> no-op practically always due to waiting for fuse writeback in >>>> ->page_mkwrite() and in course of handling write(2). >>> The problem is: if we need it in ->writepages, we need it in ->writepage too. >>> And that's where we can't have it because it would deadlock in reclaim. >> I thought we're protected from the deadlock by the following chunk >> (in the very beginning of fuse_writepage): >> >>> + if (fuse_page_is_writeback(inode, page->index)) { >>> + if (wbc->sync_mode != WB_SYNC_ALL) { >>> + redirty_page_for_writepage(wbc, page); >>> + return 0; >>> + } >>> + fuse_wait_on_page_writeback(inode, page->index); >>> + } >> Because reclaimer will never call us with WB_SYNC_ALL. Did I miss >> something? > Yeah, we could have that in ->writepage() too. And apparently that would work, > reclaim would just leave us alone. > > Then there's sync(2) which does do WB_SYNC_ALL. Yet for an unprivileged fuse > mount we don't want ->writepages() to block because that's a quite clear DoS > issue. Yes, I agree, but those cases (when sync(2) coincides with a page under fuse writeback originated by flusher coinciding with those direct AIO read redirty) should be very rare. I'd suggest to go on and put up with it for now: unprivileged users won't be able to use writeback_cache option until sysad enables allow_wbcache in fusermount. > > So we are left with this: Yes. May we implement it as a separate fix after inclusion of this patch-set? > >>> There's a way to work around this: >>> >>> - if the request is still in queue, just update it with the contents of >>> the new page >>> >>> - if the request already in userspace, create a new reqest, but only let >>> userspace have it once the previous request for the same page >>> completes, so the ordering is not messed up >>> >>> But that's a lot of hairy code. >> Is it exactly how NFS solves similar problem? > NFS will apparently just block if there's a request outstanding and we are in > WB_SYNC_ALL mode. Which is somewhat simpler. Yes, indeed. Thanks, Maxim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org