All of lore.kernel.org
 help / color / mirror / Atom feed
From: Boaz Harrosh <boaz@plexistor.com>
To: Jerome Glisse <jglisse@redhat.com>, Boaz Harrosh <openosd@gmail.com>
Cc: "Dan Williams" <dan.j.williams@intel.com>,
	"Kent Overstreet" <kent.overstreet@gmail.com>,
	"Linux Kernel Mailing List" <linux-kernel@vger.kernel.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	linux-block@vger.kernel.org, "Linux MM" <linux-mm@kvack.org>,
	"John Hubbard" <jhubbard@nvidia.com>, "Jan Kara" <jack@suse.cz>,
	"Alexander Viro" <viro@zeniv.linux.org.uk>,
	"Johannes Thumshirn" <jthumshirn@suse.de>,
	"Christoph Hellwig" <hch@lst.de>, "Jens Axboe" <axboe@kernel.dk>,
	"Ming Lei" <ming.lei@redhat.com>,
	"Jason Gunthorpe" <jgg@ziepe.ca>,
	"Matthew Wilcox" <willy@infradead.org>,
	"Steve French" <sfrench@samba.org>,
	linux-cifs@vger.kernel.org, "Yan Zheng" <zyan@redhat.com>,
	"Sage Weil" <sage@redhat.com>,
	"Ilya Dryomov" <idryomov@gmail.com>,
	"Alex Elder" <elder@kernel.org>,
	ceph-devel@vger.kernel.org,
	"Eric Van Hensbergen" <ericvh@gmail.com>,
	"Latchesar Ionkov" <lucho@ionkov.net>,
	"Mike Marshall" <hubcap@omnibond.com>,
	"Martin Brandenburg" <martin@omnibond.com>,
	"Dominique Martinet" <asmadeus@codewreck.org>,
	v9fs-developer@lists.sourceforge.net, "Coly Li" <colyli@suse.de>,
	linux-bcache@vger.kernel.org,
	"Ernesto A. Fernández" <ernesto.mnd.fernandez@gmail.com>
Subject: Re: [PATCH v1 00/15] Keep track of GUPed pages in fs and block
Date: Wed, 17 Apr 2019 04:11:03 +0300	[thread overview]
Message-ID: <fa00a2ff-3664-3165-7af8-9d9c53238245@plexistor.com> (raw)
In-Reply-To: <20190416231655.GB22465@redhat.com>

On 17/04/19 02:16, Jerome Glisse wrote:
> On Wed, Apr 17, 2019 at 01:09:22AM +0300, Boaz Harrosh wrote:
>> On 16/04/19 22:57, Jerome Glisse wrote:
>> <>
>>>
>>> A very long thread on this:
>>>
>>> https://lkml.org/lkml/2018/12/3/1128
>>>
>>> especialy all the reply to this first one
>>>
>>> There is also:
>>>
>>> https://lkml.org/lkml/2019/3/26/1395
>>> https://lwn.net/Articles/753027/
>>>
>>
>> OK I have re-read this patchset and a little bit of the threads above (not all)
>>
>> As I understand the long term plan is to keep two separate ref-counts one
>> for GUP-ref and one for the regular page-state/ownership ref.
>> Currently looking at page-ref we do not know if we have a GUP currently held.
>> With the new plan we can (Still not sure what's the full plan with this new info)
>>
>> But if you make it such as the first GUP-ref also takes a page_ref and the
>> last GUp-dec also does put_page. Then the all of these becomes a matter of
>> matching every call to get_user_pages or iov_iter_get_pages() with a new
>> put_user_pages or iov_iter_put_pages().
>>
>> Then if much below us an LLD takes a get_page() say an skb below the iscsi
>> driver, and so on. We do not care and we keep doing a put_page because we know
>> the GUP-ref holds the page for us.
>>
>> The current block layer is transparent to any page-ref it does not take any
>> nor put_page any. It is only the higher users that have done GUP that take care of that.
>>
>> The patterns I see are:
>>
>>   iov_iter_get_pages()
>>
>> 	IO(sync)
>>
>>   for(numpages)
>> 	put_page()
>>
>> Or
>>
>>   iov_iter_get_pages()
>>
>> 	IO (async)
>> 		->	foo_end_io()
>> 				put_page
>>
>> (Same with get_user_pages)
>> (IO need not be block layer. It can be networking and so on like in NFS or CIFS
>>  and so on)
> 
> They are also other code that pass around bio_vec and the code that
> fill it is disconnected from the code that release the page and they
> can mix and match GUP and non GUP AFAICT.
> 
> On fs side they are also code that fill either bio or bio_vec and
> use some extra mechanism other than bio_end to submit io through
> workqueue and then release pages (cifs for instance). Again i believe
> they can mix and match GUP and non GUP (i have not spotted something
> obvious indicating otherwise).
> 

But what I meant is why do we care at all? block layer does not inc page nor put
page in any of bio or bio_vec. It is agnostic to the page-refs.

Users register an end_io and know if pages are getted or not.
So the balanced put is up to the user.

>>
>> The first pattern is easy just add the proper new api for
>> it, so for every iov_iter_get_pages() you have an iov_iter_put_pages() and remove
>> lots of cooked up for loops. Also the all iov_iter_get_pages_use_gup() just drops.
>> (Same at get_user_pages sites use put_user_pages)
> 
> Yes this patchset already convert some of this first pattern.
> 

Right!

>> The second pattern is a bit harder because it is possible that the foo_end_io()
>> is currently used for GUP as well as none-GUP cases. this is easy to fix. But the
>> even harder case is if the same foo_end_io() call has some pages GUPed and some not
>> in the same call.
>>
>> staring at this patchset and the call sites I did not see any such places. Do you know
>> of any?
>> (We can always force such mixed-case users to always GUP-ref the pages and code
>>  foo_end_io() to GUP-dec)
> 
> I believe direct-io.c is such example thought in that case i believe it
> can only be the ZERO_PAGE so this might easily detectable. They are also
> lot of fs functions taking an iterator and then using iov_iter_get_pages*()
> to fill a bio. AFAICT those functions can be call with pipe iterator or
> iovec iterator and probably also with other iterator type. But it is all
> common code afterward (the bi_end_io function is the same no matter the
> iterator).
> 
> Thought that can probably be solve that way:
> 
> From:
>     foo_bi_end_io(struct bio *bio) {
>         ...
>         for (i = 0; i < npages; ++i) {
>             put_page(pages[i]);
>         }
>     }
> 
> To:
>     foo_bi_end_io_common(struct bio *bio) {
>         ...
>     }
> 
>     foo_bi_end_io_normal(struct bio *bio)
>         foo_bi_end_io_common(bio);
>         for (i = 0; i < npages; ++i) {
>             put_page(pages[i]);
>         }
>     }
> 
>     foo_bi_end_io_gup(struct bio *bio)
>         foo_bi_end_io_common(bio);
>         for (i = 0; i < npages; ++i) {
>             put_user_page(pages[i]);
>         }
>     }
> 

Yes or when foo_bi_end_io_common is more complicated, then just make it

     foo_bi_end_io_common(struct bio *bio, bool gup) {
         ...
     }

     foo_bi_end_io_normal(struct bio *bio)
	foo_bi_end_io_common(bio, false);
     }
 
     foo_bi_end_io_gup(struct bio *bio)
	foo_bi_end_io_common(bio, true);
     }

Less risky coding of code we do not know?

> Then when filling in the bio i either pick foo_bi_end_io_normal() or
> foo_bi_end_io_gup(). I am assuming that bio with different bi_end_io
> function never get merge.
> 

Exactly

> The issue is that some bio_add_page*() call site are disconnected
> from where the bio is allocated and initialized (and also where the
> bi_end_io function is set). This make it quite hard to ascertain
> that GUPed page and non GUP page can not co-exist in same bio.
> 

Two questions if they always do a put_page at end IO. Who takes a page_ref
in the not GUP case? page-cache? VFS? a mechanical get_page?

> Also in some cases it is not clear that the same iter is use to
> fill the same bio ie it might be possible that some code path fill
> the same bio from different iterator (and thus some pages might
> be coming from GUP and other not).
> 

This one is hard to believe for me. 
one iter may produce multiple iter_get_pages() and many more bios.
But the opposite?

I know, never say never. Do you know of a specific example?
I would like to stare at it.

> It would certainly seems to require more careful review from the
> maintainers of such fs. I tend to believe that putting the burden
> on the reviewer is a harder sell :)
> 

I think a couple carefully put WARN_ONs in the PUT path can
detect any leakage of refs. And help debug these cases.

> From quick glance:
>    - nilfs segment thing
>    - direct-io same bio accumulate pages over multiple call but
>      it should always be from same iterator and thus either always
>      be from GUP or non GUP. Also the ZERO_PAGE case should be easy
>      to catch.

Yes. Or we can always take a GUP-ref on the ZERO_PAGE as well

>    - fs/nfs/blocklayout/blocklayout.c

This one is an example of "please do not touch" if you look at the code
it currently does not do any put page at all. Though yes it does bio_add_page.

The pages are GETed and PUTed in nfs/direct.c and reference are balanced there.

this is the trivial case of for every iov_iter_get_pages[_alloc]() there is
a new defined iov_iter_put_pages[_alloc]()

So this is an example of extra not needed code changes in your approach

>    - gfs2 log buffer, that should never be page from GUP but i could
>      not ascertain that easily from quick review

	Same as NFS maybe? didn't look.

> 
> This is not extensive, i was just grepping for bio_add_page() and
> they are 2 other variant to check and i tended to discard places
> where bio is allocated in same function as bio_add_page() but this
> might not be a valid assumption either. Some bio might be allocated
> and only if there is no default bio already and then set as default
> bio which might be use latter on with different iterator.
> 

I think we do not care at all about any of the bio_add_page() or bio_alloc
places. All we care about is the call to iov_iter_get_pages* and where in the
code these puts are balanced.

If we need to split the endio case at those sights then we can do as above.
Or in the worse case when pages are really mixed. Always take a GUP  ref also
on the not GUP case. 
(I would like to see where this happens)

>>
>> So with a very careful coding I think you need not touch the block / scatter-list layers
>> nor any LLD drivers. The only code affected is the code around the get_user_pages and friends.
>> Changing the API will surface all those.
>> (IE. introduce a new API, convert one by one, Remove old API)
>>
>> Am I smoking?
> 
> No, i thought about it seemed more dangerous and harder to get right
> because some code add page in one place and setup bio in another. I
> can dig some more on that front but this still leave the non-bio user
> of bio_vec and those IIRC also suffer from same disconnect issue.
> 

Again I should not care about bio_vec. I only need to trace the balancing of the
ref taken in GUP call sight. Let me help you in those places it is not obvious to
you.

>>
>> BTW: Are you aware of the users of iov_iter_get_pages_alloc() Do they need fixing too?
> 
> Yeah and that patchset should address those already, i do not think
> i missed any.
> 

I could not find a patch for nfs/direct.c where a put_page is called
to balance the iov_iter_get_pages_alloc(). Which takes care of for example of
the blocklayout.c pages state

So I think the deep Audit needs to be for iov_iter_get_pages and get_user_pages
and the balancing of that. And the all of bio_alloc and bio_add_page should stay
agnostic to any pege-refs taking/putting

> Cheers,
> Jérôme
> 

Lets talk in LSF, see things hands on

Thanks
Boaz


  reply	other threads:[~2019-04-17  1:11 UTC|newest]

Thread overview: 75+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-04-11 21:08 [PATCH v1 00/15] Keep track of GUPed pages in fs and block jglisse
2019-04-11 21:08 ` jglisse
2019-04-11 21:08 ` [PATCH v1 01/15] fs/direct-io: fix trailing whitespace issues jglisse
2019-04-11 21:08 ` [PATCH v1 02/15] iov_iter: add helper to test if an iter would use GUP jglisse
2019-04-11 21:08 ` [PATCH v1 03/15] block: introduce bvec_page()/bvec_set_page() to get/set bio_vec.bv_page jglisse
2019-04-11 21:08 ` [PATCH v1 04/15] block: introduce BIO_VEC_INIT() macro to initialize bio_vec structure jglisse
2019-04-11 21:08   ` jglisse
2019-04-11 21:08 ` [PATCH v1 05/15] block: replace all bio_vec->bv_page by bvec_page()/bvec_set_page() jglisse
2019-04-11 21:08 ` [PATCH v1 06/15] block: convert bio_vec.bv_page to bv_pfn to store pfn and not page jglisse
2019-04-11 21:08 ` [PATCH v1 07/15] block: add bvec_put_page_dirty*() to replace put_page(bvec_page()) jglisse
2019-04-11 21:08 ` [PATCH v1 08/15] block: use bvec_put_page() instead of put_page(bvec_page()) jglisse
2019-04-11 21:08 ` [PATCH v1 09/15] block: bvec_put_page_dirty* instead of set_page_dirty* and bvec_put_page jglisse
2019-04-11 21:08 ` [PATCH v1 10/15] block: add gup flag to bio_add_page()/bio_add_pc_page()/__bio_add_page() jglisse
2019-04-15 14:59   ` Jan Kara
2019-04-15 15:24     ` Jerome Glisse
2019-04-16 16:46       ` Jan Kara
2019-04-16 16:54         ` Dan Williams
2019-04-16 16:54           ` Dan Williams
2019-04-16 17:07         ` Jerome Glisse
2019-04-16  0:22     ` Jerome Glisse
2019-04-16 16:52       ` Jan Kara
2019-04-16 18:32         ` Jerome Glisse
2019-04-11 21:08 ` [PATCH v1 11/15] block: make sure bio_add_page*() knows page that are coming from GUP jglisse
2019-04-11 21:08 ` [PATCH v1 12/15] fs/direct-io: keep track of wether a page is coming from GUP or not jglisse
2019-04-11 23:14   ` Dave Chinner
2019-04-12  0:08     ` Jerome Glisse
2019-04-11 21:08 ` [PATCH v1 13/15] fs/splice: use put_user_page() when appropriate jglisse
2019-04-11 21:08 ` [PATCH v1 14/15] fs: use bvec_set_gup_page() where appropriate jglisse
2019-04-11 21:08 ` [PATCH v1 15/15] ceph: use put_user_pages() instead of ceph_put_page_vector() jglisse
2019-04-15  7:46   ` Yan, Zheng
2019-04-15 15:11     ` Jerome Glisse
2019-04-16  0:00 ` [PATCH v1 00/15] Keep track of GUPed pages in fs and block Dave Chinner
2019-04-16  0:00   ` Dave Chinner
2019-04-16 18:35 ` Boaz Harrosh
2019-04-16 18:47   ` Jerome Glisse
2019-04-16 18:47     ` Jerome Glisse
2019-04-16 19:14     ` Boaz Harrosh
2019-04-16 18:59   ` Kent Overstreet
2019-04-16 18:59     ` Kent Overstreet
2019-04-16 19:12     ` Dan Williams
2019-04-16 19:12       ` Dan Williams
2019-04-16 19:28       ` Boaz Harrosh
2019-04-16 19:57         ` Jerome Glisse
2019-04-16 19:57           ` Jerome Glisse
2019-04-16 22:09           ` Boaz Harrosh
2019-04-16 23:16             ` Jerome Glisse
2019-04-16 23:16               ` Jerome Glisse
2019-04-17  1:11               ` Boaz Harrosh [this message]
2019-04-17  2:03                 ` Jerome Glisse
2019-04-17  2:03                   ` Jerome Glisse
2019-04-17 21:19                   ` Jerome Glisse
2019-04-17 21:19                     ` Jerome Glisse
2019-04-16 23:34             ` Jerome Glisse
2019-04-16 23:34               ` Jerome Glisse
2019-04-17 21:54         ` Dan Williams
2019-04-17 21:54           ` Dan Williams
2019-04-18 15:56           ` Boaz Harrosh
2019-04-16 19:49       ` Jerome Glisse
2019-04-16 19:49         ` Jerome Glisse
2019-04-17 21:53         ` Dan Williams
2019-04-17 21:53           ` Dan Williams
2019-04-17 22:28           ` Jerome Glisse
2019-04-17 22:28             ` Jerome Glisse
2019-04-17 23:32             ` Dan Williams
2019-04-17 23:32               ` Dan Williams
2019-04-18 10:42             ` Jan Kara
2019-04-18 10:42               ` Jan Kara
2019-04-18 14:27               ` Jerome Glisse
2019-04-18 14:27                 ` Jerome Glisse
2019-04-18 15:30                 ` Jan Kara
2019-04-18 15:30                   ` Jan Kara
2019-04-18 15:36                   ` Jerome Glisse
2019-04-18 15:36                     ` Jerome Glisse
2019-04-18 18:03               ` Dan Williams
2019-04-18 18:03                 ` Dan Williams

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=fa00a2ff-3664-3165-7af8-9d9c53238245@plexistor.com \
    --to=boaz@plexistor.com \
    --cc=asmadeus@codewreck.org \
    --cc=axboe@kernel.dk \
    --cc=ceph-devel@vger.kernel.org \
    --cc=colyli@suse.de \
    --cc=dan.j.williams@intel.com \
    --cc=elder@kernel.org \
    --cc=ericvh@gmail.com \
    --cc=ernesto.mnd.fernandez@gmail.com \
    --cc=hch@lst.de \
    --cc=hubcap@omnibond.com \
    --cc=idryomov@gmail.com \
    --cc=jack@suse.cz \
    --cc=jgg@ziepe.ca \
    --cc=jglisse@redhat.com \
    --cc=jhubbard@nvidia.com \
    --cc=jthumshirn@suse.de \
    --cc=kent.overstreet@gmail.com \
    --cc=linux-bcache@vger.kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-cifs@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lucho@ionkov.net \
    --cc=martin@omnibond.com \
    --cc=ming.lei@redhat.com \
    --cc=openosd@gmail.com \
    --cc=sage@redhat.com \
    --cc=sfrench@samba.org \
    --cc=v9fs-developer@lists.sourceforge.net \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    --cc=zyan@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.