From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.5 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_PASS,URIBL_BLOCKED,USER_AGENT_MUTT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 75162C282DA for ; Wed, 17 Apr 2019 02:04:06 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 3755B2177B for ; Wed, 17 Apr 2019 02:04:06 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728963AbfDQCD7 (ORCPT ); Tue, 16 Apr 2019 22:03:59 -0400 Received: from mx1.redhat.com ([209.132.183.28]:34494 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727063AbfDQCD7 (ORCPT ); Tue, 16 Apr 2019 22:03:59 -0400 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.phx2.redhat.com [10.5.11.16]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 6B92C3DE10; Wed, 17 Apr 2019 02:03:58 +0000 (UTC) Received: from redhat.com (ovpn-120-132.rdu2.redhat.com [10.10.120.132]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 2AC405C205; Wed, 17 Apr 2019 02:03:48 +0000 (UTC) Date: Tue, 16 Apr 2019 22:03:46 -0400 From: Jerome Glisse To: Boaz Harrosh Cc: Boaz Harrosh , Dan Williams , Kent Overstreet , Linux Kernel Mailing List , linux-fsdevel , linux-block@vger.kernel.org, Linux MM , John Hubbard , Jan Kara , Alexander Viro , Johannes Thumshirn , Christoph Hellwig , Jens Axboe , Ming Lei , Jason Gunthorpe , Matthew Wilcox , Steve French , linux-cifs@vger.kernel.org, Yan Zheng , Sage Weil , Ilya Dryomov , Alex Elder , ceph-devel@vger.kernel.org, Eric Van Hensbergen , Latchesar Ionkov , Mike Marshall , Martin Brandenburg , Dominique Martinet , v9fs-developer@lists.sourceforge.net, Coly Li , linux-bcache@vger.kernel.org, Ernesto =?iso-8859-1?Q?A=2E_Fern=E1ndez?= Subject: Re: [PATCH v1 00/15] Keep track of GUPed pages in fs and block Message-ID: <20190417020345.GB3330@redhat.com> References: <20190411210834.4105-1-jglisse@redhat.com> <2c124cc4-b97e-ee28-2926-305bc6bc74bd@plexistor.com> <20190416185922.GA12818@kmo-pixel> <20190416195735.GE21526@redhat.com> <41e2d7e1-104b-a006-2824-015ca8c76cc8@gmail.com> <20190416231655.GB22465@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.11.3 (2019-02-01) X-Scanned-By: MIMEDefang 2.79 on 10.5.11.16 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.29]); Wed, 17 Apr 2019 02:03:59 +0000 (UTC) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org On Wed, Apr 17, 2019 at 04:11:03AM +0300, Boaz Harrosh wrote: > On 17/04/19 02:16, Jerome Glisse wrote: > > On Wed, Apr 17, 2019 at 01:09:22AM +0300, Boaz Harrosh wrote: > >> On 16/04/19 22:57, Jerome Glisse wrote: > >> <> > >>> > >>> A very long thread on this: > >>> > >>> https://lkml.org/lkml/2018/12/3/1128 > >>> > >>> especialy all the reply to this first one > >>> > >>> There is also: > >>> > >>> https://lkml.org/lkml/2019/3/26/1395 > >>> https://lwn.net/Articles/753027/ > >>> > >> > >> OK I have re-read this patchset and a little bit of the threads above (not all) > >> > >> As I understand the long term plan is to keep two separate ref-counts one > >> for GUP-ref and one for the regular page-state/ownership ref. > >> Currently looking at page-ref we do not know if we have a GUP currently held. > >> With the new plan we can (Still not sure what's the full plan with this new info) > >> > >> But if you make it such as the first GUP-ref also takes a page_ref and the > >> last GUp-dec also does put_page. Then the all of these becomes a matter of > >> matching every call to get_user_pages or iov_iter_get_pages() with a new > >> put_user_pages or iov_iter_put_pages(). > >> > >> Then if much below us an LLD takes a get_page() say an skb below the iscsi > >> driver, and so on. We do not care and we keep doing a put_page because we know > >> the GUP-ref holds the page for us. > >> > >> The current block layer is transparent to any page-ref it does not take any > >> nor put_page any. It is only the higher users that have done GUP that take care of that. > >> > >> The patterns I see are: > >> > >> iov_iter_get_pages() > >> > >> IO(sync) > >> > >> for(numpages) > >> put_page() > >> > >> Or > >> > >> iov_iter_get_pages() > >> > >> IO (async) > >> -> foo_end_io() > >> put_page > >> > >> (Same with get_user_pages) > >> (IO need not be block layer. It can be networking and so on like in NFS or CIFS > >> and so on) > > > > They are also other code that pass around bio_vec and the code that > > fill it is disconnected from the code that release the page and they > > can mix and match GUP and non GUP AFAICT. > > > > On fs side they are also code that fill either bio or bio_vec and > > use some extra mechanism other than bio_end to submit io through > > workqueue and then release pages (cifs for instance). Again i believe > > they can mix and match GUP and non GUP (i have not spotted something > > obvious indicating otherwise). > > > > But what I meant is why do we care at all? block layer does not inc page nor put > page in any of bio or bio_vec. It is agnostic to the page-refs. > > Users register an end_io and know if pages are getted or not. > So the balanced put is up to the user. > > >> > >> The first pattern is easy just add the proper new api for > >> it, so for every iov_iter_get_pages() you have an iov_iter_put_pages() and remove > >> lots of cooked up for loops. Also the all iov_iter_get_pages_use_gup() just drops. > >> (Same at get_user_pages sites use put_user_pages) > > > > Yes this patchset already convert some of this first pattern. > > > > Right! > > >> The second pattern is a bit harder because it is possible that the foo_end_io() > >> is currently used for GUP as well as none-GUP cases. this is easy to fix. But the > >> even harder case is if the same foo_end_io() call has some pages GUPed and some not > >> in the same call. > >> > >> staring at this patchset and the call sites I did not see any such places. Do you know > >> of any? > >> (We can always force such mixed-case users to always GUP-ref the pages and code > >> foo_end_io() to GUP-dec) > > > > I believe direct-io.c is such example thought in that case i believe it > > can only be the ZERO_PAGE so this might easily detectable. They are also > > lot of fs functions taking an iterator and then using iov_iter_get_pages*() > > to fill a bio. AFAICT those functions can be call with pipe iterator or > > iovec iterator and probably also with other iterator type. But it is all > > common code afterward (the bi_end_io function is the same no matter the > > iterator). > > > > Thought that can probably be solve that way: > > > > From: > > foo_bi_end_io(struct bio *bio) { > > ... > > for (i = 0; i < npages; ++i) { > > put_page(pages[i]); > > } > > } > > > > To: > > foo_bi_end_io_common(struct bio *bio) { > > ... > > } > > > > foo_bi_end_io_normal(struct bio *bio) > > foo_bi_end_io_common(bio); > > for (i = 0; i < npages; ++i) { > > put_page(pages[i]); > > } > > } > > > > foo_bi_end_io_gup(struct bio *bio) > > foo_bi_end_io_common(bio); > > for (i = 0; i < npages; ++i) { > > put_user_page(pages[i]); > > } > > } > > > > Yes or when foo_bi_end_io_common is more complicated, then just make it > > foo_bi_end_io_common(struct bio *bio, bool gup) { > ... > } > > foo_bi_end_io_normal(struct bio *bio) > foo_bi_end_io_common(bio, false); > } > > foo_bi_end_io_gup(struct bio *bio) > foo_bi_end_io_common(bio, true); > } > > Less risky coding of code we do not know? Yes whatever is more appropriate for eahc end_io func. > > > Then when filling in the bio i either pick foo_bi_end_io_normal() or > > foo_bi_end_io_gup(). I am assuming that bio with different bi_end_io > > function never get merge. > > > > Exactly > > > The issue is that some bio_add_page*() call site are disconnected > > from where the bio is allocated and initialized (and also where the > > bi_end_io function is set). This make it quite hard to ascertain > > that GUPed page and non GUP page can not co-exist in same bio. > > > > Two questions if they always do a put_page at end IO. Who takes a page_ref > in the not GUP case? page-cache? VFS? a mechanical get_page? It depends, some get page to page-cache (find_get_page), other just get_page on a page that is coming from unknown origin (ie it is not clear from within the function where the page is coming from), other allocate a page and transfer the page reference to the bio ie the page will get free once the bio end function is call through as it call call put_page on page within the bio. So there is not one simple story here. Hence why it looks harder to me (still likely do-able). > > Also in some cases it is not clear that the same iter is use to > > fill the same bio ie it might be possible that some code path fill > > the same bio from different iterator (and thus some pages might > > be coming from GUP and other not). > > > > This one is hard to believe for me. > one iter may produce multiple iter_get_pages() and many more bios. > But the opposite? > > I know, never say never. Do you know of a specific example? > I would like to stare at it. No i do not know of any such example but reading through a lot of code i could not convince myself of that fact. I agree with your feeling on the matter i just don't have proof. I can spend more time digging all code path and ascertaining thing but this will inevitably be a snapshot in time and something that people might break in the future. > > > It would certainly seems to require more careful review from the > > maintainers of such fs. I tend to believe that putting the burden > > on the reviewer is a harder sell :) > > > > I think a couple carefully put WARN_ONs in the PUT path can > detect any leakage of refs. And help debug these cases. Yes we do plan on catching things like underflow (easy one to catch) or put_page that bring the refcount just below the bias value ... and in anycase the only harm that can come of this should be memory leak. So i believe that even if we miss some weird corner case we should be able to catch it eventualy and nothing harmful should ever happen, famous last word i will regret when i get burnt at the stake for missing that one case ;) > > > From quick glance: > > - nilfs segment thing > > - direct-io same bio accumulate pages over multiple call but > > it should always be from same iterator and thus either always > > be from GUP or non GUP. Also the ZERO_PAGE case should be easy > > to catch. > > Yes. Or we can always take a GUP-ref on the ZERO_PAGE as well > > > - fs/nfs/blocklayout/blocklayout.c > > This one is an example of "please do not touch" if you look at the code > it currently does not do any put page at all. Though yes it does bio_add_page. > > The pages are GETed and PUTed in nfs/direct.c and reference are balanced there. > > this is the trivial case of for every iov_iter_get_pages[_alloc]() there is > a new defined iov_iter_put_pages[_alloc]() > > So this is an example of extra not needed code changes in your approach > > > - gfs2 log buffer, that should never be page from GUP but i could > > not ascertain that easily from quick review > > Same as NFS maybe? didn't look. > > > > > This is not extensive, i was just grepping for bio_add_page() and > > they are 2 other variant to check and i tended to discard places > > where bio is allocated in same function as bio_add_page() but this > > might not be a valid assumption either. Some bio might be allocated > > and only if there is no default bio already and then set as default > > bio which might be use latter on with different iterator. > > > > I think we do not care at all about any of the bio_add_page() or bio_alloc > places. All we care about is the call to iov_iter_get_pages* and where in the > code these puts are balanced. > > If we need to split the endio case at those sights then we can do as above. > Or in the worse case when pages are really mixed. Always take a GUP ref also > on the not GUP case. > (I would like to see where this happens) > > >> > >> So with a very careful coding I think you need not touch the block / scatter-list layers > >> nor any LLD drivers. The only code affected is the code around the get_user_pages and friends. > >> Changing the API will surface all those. > >> (IE. introduce a new API, convert one by one, Remove old API) > >> > >> Am I smoking? > > > > No, i thought about it seemed more dangerous and harder to get right > > because some code add page in one place and setup bio in another. I > > can dig some more on that front but this still leave the non-bio user > > of bio_vec and those IIRC also suffer from same disconnect issue. > > > > Again I should not care about bio_vec. I only need to trace the balancing of the > ref taken in GUP call sight. Let me help you in those places it is not obvious to > you. > > >> > >> BTW: Are you aware of the users of iov_iter_get_pages_alloc() Do they need fixing too? > > > > Yeah and that patchset should address those already, i do not think > > i missed any. > > > > I could not find a patch for nfs/direct.c where a put_page is called > to balance the iov_iter_get_pages_alloc(). Which takes care of for example of > the blocklayout.c pages state > > So I think the deep Audit needs to be for iov_iter_get_pages and get_user_pages > and the balancing of that. And the all of bio_alloc and bio_add_page should stay > agnostic to any pege-refs taking/putting Will try to get started on that and see if i hit any roadblock. I will report once i get my feet wet, or at least before i drown ;) Cheers, Jérôme