All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] fix unbalanced page refcounting in bio_map_user_iov
@ 2017-09-22  5:18 Vitaly Mayatskikh
  2017-09-22  5:24 ` Vitaly Mayatskikh
  2017-09-23 16:39 ` Al Viro
  0 siblings, 2 replies; 9+ messages in thread
From: Vitaly Mayatskikh @ 2017-09-22  5:18 UTC (permalink / raw)
  To: linux-block, linux-kernel; +Cc: Jens Axboe

bio_map_user_iov and bio_unmap_user do unbalanced pages refcounting if
IO vector has small consecutive buffers belonging to the same page.
bio_add_pc_page merges them into one, but the page reference is never
dropped.

Signed-off-by: Vitaly Mayatskikh <v.mayatskih@gmail.com>

diff --git a/block/bio.c b/block/bio.c
index b38e962fa83e..10cd3b6bed27 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1383,6 +1383,7 @@ struct bio *bio_map_user_iov(struct request_queue *q,
 		offset = offset_in_page(uaddr);
 		for (j = cur_page; j < page_limit; j++) {
 			unsigned int bytes = PAGE_SIZE - offset;
+			unsigned short prev_bi_vcnt = bio->bi_vcnt;
 
 			if (len <= 0)
 				break;
@@ -1397,6 +1398,13 @@ struct bio *bio_map_user_iov(struct request_queue *q,
 					    bytes)
 				break;
 
+			/*
+			 * check if vector was merged with previous
+			 * drop page reference if needed
+			 */
+			if (bio->bi_vcnt == prev_bi_vcnt)
+				put_page(pages[j]);
+
 			len -= bytes;
 			offset = 0;
 		}

-- 
wbr, Vitaly

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] fix unbalanced page refcounting in bio_map_user_iov
  2017-09-22  5:18 [PATCH] fix unbalanced page refcounting in bio_map_user_iov Vitaly Mayatskikh
@ 2017-09-22  5:24 ` Vitaly Mayatskikh
  2017-09-23 16:39 ` Al Viro
  1 sibling, 0 replies; 9+ messages in thread
From: Vitaly Mayatskikh @ 2017-09-22  5:24 UTC (permalink / raw)
  To: linux-block, linux-kernel; +Cc: Jens Axboe

Reproducer (needs SCSI disk):

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/ioctl.h>
#include <fcntl.h>
#include <errno.h>
#include <malloc.h>
#include <string.h>
#include <scsi/sg.h>
 
#define NR_IOS 10000
#define NR_IOVECS 8
#define SG_IO 0x2285
 
int main(int argc, char *argv[])
{
        int fd, i, j;
        unsigned char *buf, *ptr, cdb[10];
        sg_io_hdr_t io_hdr;
        sg_iovec_t iovec[NR_IOVECS];
 
        if (argc < 2) {
                printf("Run: %s </dev/sdX>\n", argv[0]);
                exit(1);
        }
 
        buf = ptr = memalign(4096, NR_IOS * NR_IOVECS * 512);
        if (!buf) {
                printf("can't alloc memory\n");
                exit(1);
        }
 
        fd = open(argv[1], 0);
        if (fd < 0) {
                printf("open %s failed: %d (%s)\n", argv[1], errno, strerror(errno));
                exit(1);
        }
 
        io_hdr.interface_id = 'S';
        io_hdr.cmd_len = sizeof(cdb);
        io_hdr.cmdp = cdb;
        io_hdr.dxfer_direction = SG_DXFER_FROM_DEV;
        io_hdr.dxfer_len = 512 * NR_IOVECS;
        io_hdr.dxferp = iovec;
        io_hdr.iovec_count = NR_IOVECS;
 
        cdb[0] = 0x28;      // READ10
        cdb[8] = NR_IOVECS; // sectors
 
        for (j = 0; j < NR_IOS; j++, ptr += 512) {
                for (i = 0; i < NR_IOVECS; i++) {
                        iovec[i].iov_base = ptr;
                        iovec[i].iov_len = 512;
                }
                if (ioctl(fd, SG_IO, &io_hdr)) {
                        printf("IOCTL failed: %d (%s)\n", errno, strerror(errno));
                        exit(1);
                }
        }
 
        free(buf);
        close(fd);
        return 0;
}


# free -m
              total        used        free      shared  buff/cache   available
Mem:           3827          46        3601           0         178        3568
Swap:             0           0           0
# ./sgio-leak /dev/sdd
# free -m
              total        used        free      shared  buff/cache   available
Mem:           3827          85        3562           0         178        3529
Swap:             0           0           0
[root@node-A ~]# free -m
              total        used        free      shared  buff/cache   available
Mem:           3827          85        3628           0         113        3561
Swap:             0           0           0
# ./sgio-leak /dev/sdd
# free -m
              total        used        free      shared  buff/cache   available
Mem:           3827         124        3589           0         113        3523
Swap:             0           0           0

-- 
wbr, Vitaly

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] fix unbalanced page refcounting in bio_map_user_iov
  2017-09-22  5:18 [PATCH] fix unbalanced page refcounting in bio_map_user_iov Vitaly Mayatskikh
  2017-09-22  5:24 ` Vitaly Mayatskikh
@ 2017-09-23 16:39 ` Al Viro
  2017-09-23 16:55   ` Al Viro
  1 sibling, 1 reply; 9+ messages in thread
From: Al Viro @ 2017-09-23 16:39 UTC (permalink / raw)
  To: Vitaly Mayatskikh; +Cc: linux-block, linux-kernel, Jens Axboe

On Fri, Sep 22, 2017 at 01:18:39AM -0400, Vitaly Mayatskikh wrote:
> bio_map_user_iov and bio_unmap_user do unbalanced pages refcounting if
> IO vector has small consecutive buffers belonging to the same page.
> bio_add_pc_page merges them into one, but the page reference is never
> dropped.
> 
> Signed-off-by: Vitaly Mayatskikh <v.mayatskih@gmail.com>
> 
> diff --git a/block/bio.c b/block/bio.c
> index b38e962fa83e..10cd3b6bed27 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1383,6 +1383,7 @@ struct bio *bio_map_user_iov(struct request_queue *q,
>  		offset = offset_in_page(uaddr);
>  		for (j = cur_page; j < page_limit; j++) {
>  			unsigned int bytes = PAGE_SIZE - offset;
> +			unsigned short prev_bi_vcnt = bio->bi_vcnt;
>  
>  			if (len <= 0)
>  				break;
> @@ -1397,6 +1398,13 @@ struct bio *bio_map_user_iov(struct request_queue *q,
>  					    bytes)
>  				break;
>  
> +			/*
> +			 * check if vector was merged with previous
> +			 * drop page reference if needed
> +			 */
> +			if (bio->bi_vcnt == prev_bi_vcnt)
> +				put_page(pages[j]);
> +

Except that now you've got double-puts on failure exits ;-/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] fix unbalanced page refcounting in bio_map_user_iov
  2017-09-23 16:39 ` Al Viro
@ 2017-09-23 16:55   ` Al Viro
  2017-09-23 17:19     ` Al Viro
  0 siblings, 1 reply; 9+ messages in thread
From: Al Viro @ 2017-09-23 16:55 UTC (permalink / raw)
  To: Vitaly Mayatskikh; +Cc: linux-block, linux-kernel, Jens Axboe

On Sat, Sep 23, 2017 at 05:39:28PM +0100, Al Viro wrote:
> On Fri, Sep 22, 2017 at 01:18:39AM -0400, Vitaly Mayatskikh wrote:
> > bio_map_user_iov and bio_unmap_user do unbalanced pages refcounting if
> > IO vector has small consecutive buffers belonging to the same page.
> > bio_add_pc_page merges them into one, but the page reference is never
> > dropped.
> > 
> > Signed-off-by: Vitaly Mayatskikh <v.mayatskih@gmail.com>
> > 
> > diff --git a/block/bio.c b/block/bio.c
> > index b38e962fa83e..10cd3b6bed27 100644
> > --- a/block/bio.c
> > +++ b/block/bio.c
> > @@ -1383,6 +1383,7 @@ struct bio *bio_map_user_iov(struct request_queue *q,
> >  		offset = offset_in_page(uaddr);
> >  		for (j = cur_page; j < page_limit; j++) {
> >  			unsigned int bytes = PAGE_SIZE - offset;
> > +			unsigned short prev_bi_vcnt = bio->bi_vcnt;
> >  
> >  			if (len <= 0)
> >  				break;
> > @@ -1397,6 +1398,13 @@ struct bio *bio_map_user_iov(struct request_queue *q,
> >  					    bytes)
> >  				break;
> >  
> > +			/*
> > +			 * check if vector was merged with previous
> > +			 * drop page reference if needed
> > +			 */
> > +			if (bio->bi_vcnt == prev_bi_vcnt)
> > +				put_page(pages[j]);
> > +
> 
> Except that now you've got double-puts on failure exits ;-/

IOW, the loop on failure exit should go through the bio, like __bio_unmap_user()
does.  We *also* need to put everything left unused in pages[], but only from the
last iteration through iov_for_each().

Frankly, I would prefer to reuse the pages[], rather than append to it on each
iteration.  Used iov_iter_get_pages_alloc(), actually.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] fix unbalanced page refcounting in bio_map_user_iov
  2017-09-23 16:55   ` Al Viro
@ 2017-09-23 17:19     ` Al Viro
  2017-09-23 20:33       ` Al Viro
  0 siblings, 1 reply; 9+ messages in thread
From: Al Viro @ 2017-09-23 17:19 UTC (permalink / raw)
  To: Vitaly Mayatskikh; +Cc: linux-block, linux-kernel, Jens Axboe

On Sat, Sep 23, 2017 at 05:55:37PM +0100, Al Viro wrote:

> IOW, the loop on failure exit should go through the bio, like __bio_unmap_user()
> does.  We *also* need to put everything left unused in pages[], but only from the
> last iteration through iov_for_each().
> 
> Frankly, I would prefer to reuse the pages[], rather than append to it on each
> iteration.  Used iov_iter_get_pages_alloc(), actually.

Something like completely untested diff below, perhaps...

diff --git a/block/bio.c b/block/bio.c
index b38e962fa83e..b5fe23597b41 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1323,94 +1323,60 @@ struct bio *bio_map_user_iov(struct request_queue *q,
 			     const struct iov_iter *iter,
 			     gfp_t gfp_mask)
 {
-	int j;
-	int nr_pages = 0;
-	struct page **pages;
 	struct bio *bio;
-	int cur_page = 0;
-	int ret, offset;
+	struct bio_vec *bvec;
 	struct iov_iter i;
-	struct iovec iov;
-
-	iov_for_each(iov, i, *iter) {
-		unsigned long uaddr = (unsigned long) iov.iov_base;
-		unsigned long len = iov.iov_len;
-		unsigned long end = (uaddr + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
-		unsigned long start = uaddr >> PAGE_SHIFT;
-
-		/*
-		 * Overflow, abort
-		 */
-		if (end < start)
-			return ERR_PTR(-EINVAL);
-
-		nr_pages += end - start;
-		/*
-		 * buffer must be aligned to at least logical block size for now
-		 */
-		if (uaddr & queue_dma_alignment(q))
-			return ERR_PTR(-EINVAL);
-	}
+	int ret, j;
 
-	if (!nr_pages)
+	if (!iov_iter_count(iter))
 		return ERR_PTR(-EINVAL);
 
-	bio = bio_kmalloc(gfp_mask, nr_pages);
+	bio = bio_kmalloc(gfp_mask, iov_iter_npages(iter, BIO_MAX_PAGES));
 	if (!bio)
 		return ERR_PTR(-ENOMEM);
 
-	ret = -ENOMEM;
-	pages = kcalloc(nr_pages, sizeof(struct page *), gfp_mask);
-	if (!pages)
-		goto out;
+	i = *iter;
+	while (iov_iter_count(&i)) {
+		struct page **pages;
+		size_t offs;
+		ssize_t bytes;
+		int npages, j;
 
-	iov_for_each(iov, i, *iter) {
-		unsigned long uaddr = (unsigned long) iov.iov_base;
-		unsigned long len = iov.iov_len;
-		unsigned long end = (uaddr + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
-		unsigned long start = uaddr >> PAGE_SHIFT;
-		const int local_nr_pages = end - start;
-		const int page_limit = cur_page + local_nr_pages;
-
-		ret = get_user_pages_fast(uaddr, local_nr_pages,
-				(iter->type & WRITE) != WRITE,
-				&pages[cur_page]);
-		if (ret < local_nr_pages) {
-			ret = -EFAULT;
+		bytes = iov_iter_get_pages_alloc(&i, &pages, LONG_MAX, &offs);
+		if (bytes <= 0) {
+			ret = bytes ? bytes : -EFAULT;
 			goto out_unmap;
 		}
 
-		offset = offset_in_page(uaddr);
-		for (j = cur_page; j < page_limit; j++) {
-			unsigned int bytes = PAGE_SIZE - offset;
+		npages = DIV_ROUND_UP(offs + bytes, PAGE_SIZE);
 
-			if (len <= 0)
-				break;
+		if (unlikely(offs & queue_dma_alignment(q))) {
+			ret = -EINVAL;
+			j = 0;
+		} else {
+			for (j = 0; j < npages; j++) {
+				struct page *page = pages[j];
+				unsigned n = PAGE_SIZE - offs;
+				unsigned prev_bi_vcnt = bio->bi_vcnt;
+
+				if (!bio_add_pc_page(q, bio, page, n, offs)) {
+					iov_iter_truncate(&i, 0);
+					break;
+				}
+
+				if (bio->bi_vcnt == prev_bi_vcnt)
+					put_page(page);
 			
-			if (bytes > len)
-				bytes = len;
-
-			/*
-			 * sorry...
-			 */
-			if (bio_add_pc_page(q, bio, pages[j], bytes, offset) <
-					    bytes)
-				break;
-
-			len -= bytes;
-			offset = 0;
+				iov_iter_advance(&i, n);
+				bytes -= n;
+				offs = 0;
+			}
 		}
-
-		cur_page = j;
-		/*
-		 * release the pages we didn't map into the bio, if any
-		 */
-		while (j < page_limit)
+		while (j < npages)
 			put_page(pages[j++]);
+		kvfree(pages);
 	}
 
-	kfree(pages);
-
 	bio_set_flag(bio, BIO_USER_MAPPED);
 
 	/*
@@ -1423,13 +1389,9 @@ struct bio *bio_map_user_iov(struct request_queue *q,
 	return bio;
 
  out_unmap:
-	for (j = 0; j < nr_pages; j++) {
-		if (!pages[j])
-			break;
-		put_page(pages[j]);
+	bio_for_each_segment_all(bvec, bio, j) {
+		put_page(bvec->bv_page);
 	}
- out:
-	kfree(pages);
 	bio_put(bio);
 	return ERR_PTR(ret);
 }

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] fix unbalanced page refcounting in bio_map_user_iov
  2017-09-23 17:19     ` Al Viro
@ 2017-09-23 20:33       ` Al Viro
  2017-09-24 14:27         ` Al Viro
  0 siblings, 1 reply; 9+ messages in thread
From: Al Viro @ 2017-09-23 20:33 UTC (permalink / raw)
  To: Vitaly Mayatskikh; +Cc: linux-block, linux-kernel, Jens Axboe

On Sat, Sep 23, 2017 at 06:19:26PM +0100, Al Viro wrote:
> On Sat, Sep 23, 2017 at 05:55:37PM +0100, Al Viro wrote:
> 
> > IOW, the loop on failure exit should go through the bio, like __bio_unmap_user()
> > does.  We *also* need to put everything left unused in pages[], but only from the
> > last iteration through iov_for_each().
> > 
> > Frankly, I would prefer to reuse the pages[], rather than append to it on each
> > iteration.  Used iov_iter_get_pages_alloc(), actually.
> 
> Something like completely untested diff below, perhaps...

> +				unsigned n = PAGE_SIZE - offs;
> +				unsigned prev_bi_vcnt = bio->bi_vcnt;

Sorry, that should've been followed by
				if (n > bytes)
					n = bytes;

Anyway, a carved-up variant is in vfs.git#work.iov_iter.  It still needs
review and testing; the patch Vitaly has posted in this thread plus 6
followups, hopefully more readable than aggregate diff.

Comments?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] fix unbalanced page refcounting in bio_map_user_iov
  2017-09-23 20:33       ` Al Viro
@ 2017-09-24 14:27         ` Al Viro
  2017-09-24 17:15           ` Al Viro
  2017-09-25  1:48           ` Vitaly Mayatskikh
  0 siblings, 2 replies; 9+ messages in thread
From: Al Viro @ 2017-09-24 14:27 UTC (permalink / raw)
  To: linux-block
  Cc: linux-kernel, Jens Axboe, Christoph Hellwig, Vitaly Mayatskikh

On Sat, Sep 23, 2017 at 09:33:23PM +0100, Al Viro wrote:
> On Sat, Sep 23, 2017 at 06:19:26PM +0100, Al Viro wrote:
> > On Sat, Sep 23, 2017 at 05:55:37PM +0100, Al Viro wrote:
> > 
> > > IOW, the loop on failure exit should go through the bio, like __bio_unmap_user()
> > > does.  We *also* need to put everything left unused in pages[], but only from the
> > > last iteration through iov_for_each().
> > > 
> > > Frankly, I would prefer to reuse the pages[], rather than append to it on each
> > > iteration.  Used iov_iter_get_pages_alloc(), actually.
> > 
> > Something like completely untested diff below, perhaps...
> 
> > +				unsigned n = PAGE_SIZE - offs;
> > +				unsigned prev_bi_vcnt = bio->bi_vcnt;
> 
> Sorry, that should've been followed by
> 				if (n > bytes)
> 					n = bytes;
> 
> Anyway, a carved-up variant is in vfs.git#work.iov_iter.  It still needs
> review and testing; the patch Vitaly has posted in this thread plus 6
> followups, hopefully more readable than aggregate diff.
> 
> Comments?

BTW, there's something fishy in bio_copy_user_iov().  If the area we'd asked for
had been too large for a single bio, we are going to create a bio and have
 bio_add_pc_page() eventually fill it up to limit.  Then we return into
__blk_rq_map_user_iov(), advance iter and call bio_copy_user_iov() again.
Fine, but... now we might have non-zero iter->iov_offset.  And this
        bmd->is_our_pages = map_data ? 0 : 1;
        memcpy(bmd->iov, iter->iov, sizeof(struct iovec) * iter->nr_segs);
        iov_iter_init(&bmd->iter, iter->type, bmd->iov,
                        iter->nr_segs, iter->count);
does not even look at iter->iov_offset.  As the result, when it gets to
bio_uncopy_user(), we copy the data from each bio into the *beginning* of
the user area, overwriting that from the other bio.

At the very least, we need bmd->iter = *iter; bmd->iter.iov = bmd->iov;
instead of that iov_iter_init() in there.  I'm not sure how far back does
it go; looks like "block: support large requests in blk_rq_map_user_iov"
is the earliest possible point, but it might need more digging to make
sure.  v4.5+, if that's when the problems began...

Anyway, I'd added the obvious fix to #work.iov_iter, reordered it and
force-pushed the result.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] fix unbalanced page refcounting in bio_map_user_iov
  2017-09-24 14:27         ` Al Viro
@ 2017-09-24 17:15           ` Al Viro
  2017-09-25  1:48           ` Vitaly Mayatskikh
  1 sibling, 0 replies; 9+ messages in thread
From: Al Viro @ 2017-09-24 17:15 UTC (permalink / raw)
  To: linux-block
  Cc: linux-kernel, Jens Axboe, Christoph Hellwig, Vitaly Mayatskikh

On Sun, Sep 24, 2017 at 03:27:39PM +0100, Al Viro wrote:

> At the very least, we need bmd->iter = *iter; bmd->iter.iov = bmd->iov;
> instead of that iov_iter_init() in there.  I'm not sure how far back does
> it go; looks like "block: support large requests in blk_rq_map_user_iov"
> is the earliest possible point, but it might need more digging to make
> sure.  v4.5+, if that's when the problems began...
> 
> Anyway, I'd added the obvious fix to #work.iov_iter, reordered it and
> force-pushed the result.

While we are at it, calculation of nr_pages in bio_copy_user_iov() is bloody
odd - why, in the name of everything unholy, does it care about the iovec
boundaries in there?  We are copying data anyway; why does allocation of bio
care about the fragmentation of the other end of copying?  Shouldn't it be
simply max(DIV_ROUND_UP(offset + len, PAGE_SIZE), BIO_MAX_PAGES)?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] fix unbalanced page refcounting in bio_map_user_iov
  2017-09-24 14:27         ` Al Viro
  2017-09-24 17:15           ` Al Viro
@ 2017-09-25  1:48           ` Vitaly Mayatskikh
  1 sibling, 0 replies; 9+ messages in thread
From: Vitaly Mayatskikh @ 2017-09-25  1:48 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-block, linux-kernel, Jens Axboe, Christoph Hellwig

On Sun, 24 Sep 2017 10:27:39 -0400,
Al Viro wrote:

> BTW, there's something fishy in bio_copy_user_iov().  If the area we'd asked for
> had been too large for a single bio, we are going to create a bio and have
>  bio_add_pc_page() eventually fill it up to limit.  Then we return into
> __blk_rq_map_user_iov(), advance iter and call bio_copy_user_iov() again.
> Fine, but... now we might have non-zero iter->iov_offset.  And this
>         bmd->is_our_pages = map_data ? 0 : 1;
>         memcpy(bmd->iov, iter->iov, sizeof(struct iovec) * iter->nr_segs);
>         iov_iter_init(&bmd->iter, iter->type, bmd->iov,
>                         iter->nr_segs, iter->count);
> does not even look at iter->iov_offset.  As the result, when it gets to
> bio_uncopy_user(), we copy the data from each bio into the *beginning* of
> the user area, overwriting that from the other bio.

Yeah, something is wrong with bio_copy_user_iov. Our datapath hangs when IO flows through unmodified SG (it forces bio_copy if iov_count is set). I did not look at details, but same IO pattern and memory layout work well with bio_map (module refcount problem).

> Anyway, I'd added the obvious fix to #work.iov_iter, reordered it and
> force-pushed the result.

I'll give it a try, thanks!
-- 
wbr, Vitaly

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2017-09-25  1:48 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-22  5:18 [PATCH] fix unbalanced page refcounting in bio_map_user_iov Vitaly Mayatskikh
2017-09-22  5:24 ` Vitaly Mayatskikh
2017-09-23 16:39 ` Al Viro
2017-09-23 16:55   ` Al Viro
2017-09-23 17:19     ` Al Viro
2017-09-23 20:33       ` Al Viro
2017-09-24 14:27         ` Al Viro
2017-09-24 17:15           ` Al Viro
2017-09-25  1:48           ` Vitaly Mayatskikh

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.