linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Crash when IO is being submitted and block size is changed
@ 2012-06-28  3:04 Mikulas Patocka
  2012-06-28 11:15 ` Jan Kara
  2012-06-29  6:25 ` Crash when IO is being submitted and block size is changed Vyacheslav Dubeyko
  0 siblings, 2 replies; 45+ messages in thread
From: Mikulas Patocka @ 2012-06-28  3:04 UTC (permalink / raw)
  To: Alexander Viro, Jens Axboe, Alasdair G. Kergon
  Cc: linux-kernel, linux-fsdevel, linux-mm, dm-devel

Hi

The kernel crashes when IO is being submitted to a block device and block 
size of that device is changed simultaneously.

To reproduce the crash, apply this patch:

--- linux-3.4.3-fast.orig/fs/block_dev.c 2012-06-27 20:24:07.000000000 +0200
+++ linux-3.4.3-fast/fs/block_dev.c 2012-06-27 20:28:34.000000000 +0200
@@ -28,6 +28,7 @@
 #include <linux/log2.h>
 #include <linux/cleancache.h>
 #include <asm/uaccess.h> 
+#include <linux/delay.h>
 #include "internal.h"
 struct bdev_inode {
@@ -203,6 +204,7 @@ blkdev_get_blocks(struct inode *inode, s
 
 	bh->b_bdev = I_BDEV(inode);
 	bh->b_blocknr = iblock;
+	msleep(1000);
 	bh->b_size = max_blocks << inode->i_blkbits;
 	if (max_blocks)
 		set_buffer_mapped(bh);

Use some device with 4k blocksize, for example a ramdisk.
Run "dd if=/dev/ram0 of=/dev/null bs=4k count=1 iflag=direct"
While it is sleeping in the msleep function, run "blockdev --setbsz 2048 
/dev/ram0" on the other console.
You get a BUG at fs/direct-io.c:1013 - BUG_ON(this_chunk_bytes == 0);


One may ask "why would anyone do this - submit I/O and change block size 
simultaneously?" - the problem is that udev and lvm can scan and read all 
block devices anytime - so anytime you change block device size, there may 
be some i/o to that device in flight and the crash may happen. That BUG 
actually happened in production environment because of lvm scanning block 
devices and some other software changing block size at the same time.


I would like to know, what is your opinion on fixing this crash? There are 
several possibilities:

* we could potentially read i_blkbits once, store it in the direct i/o 
structure and never read it again - direct i/o could be maybe modified for 
this (it reads i_blkbits only at a few places). But what about non-direct 
i/o? Non-direct i/o is reading i_blkbits much more often and the code was 
obviously written without consideration that it may change - for block 
devices, i_blkbits is essentially a random value that can change anytime 
you read it and the code of block_read_full_page, __block_write_begin, 
__block_write_full_page and others doesn't seem to take it into account.

* put some rw-lock arond all I/Os on block device. The rw-lock would be 
taken for read on all I/O paths and it would be taken for write when 
changing the block device size. The downside would be a possible 
performance hit of the rw-lock. The rw-lock could be made per-cpu to avoid 
cache line bouncing (take the rw-lock belonging to the current cpu for 
read; for write take all cpus' locks).

* allow changing block size only if the device is open only once and the 
process is singlethreaded? (so there couldn't be any outstanding I/Os). I 
don't know if this could be tested reliably... Another question: what to 
do if the device is open multiple times?

Do you have any other ideas what to do with it?

Mikulas

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Crash when IO is being submitted and block size is changed
  2012-06-28  3:04 Crash when IO is being submitted and block size is changed Mikulas Patocka
@ 2012-06-28 11:15 ` Jan Kara
  2012-06-28 15:44   ` Mikulas Patocka
  2012-07-16  0:55   ` Mikulas Patocka
  2012-06-29  6:25 ` Crash when IO is being submitted and block size is changed Vyacheslav Dubeyko
  1 sibling, 2 replies; 45+ messages in thread
From: Jan Kara @ 2012-06-28 11:15 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Alexander Viro, Jens Axboe, Alasdair G. Kergon, linux-kernel,
	linux-fsdevel, linux-mm, dm-devel

On Wed 27-06-12 23:04:09, Mikulas Patocka wrote:
> The kernel crashes when IO is being submitted to a block device and block 
> size of that device is changed simultaneously.
  Nasty ;-)

> To reproduce the crash, apply this patch:
> 
> --- linux-3.4.3-fast.orig/fs/block_dev.c 2012-06-27 20:24:07.000000000 +0200
> +++ linux-3.4.3-fast/fs/block_dev.c 2012-06-27 20:28:34.000000000 +0200
> @@ -28,6 +28,7 @@
>  #include <linux/log2.h>
>  #include <linux/cleancache.h>
>  #include <asm/uaccess.h> 
> +#include <linux/delay.h>
>  #include "internal.h"
>  struct bdev_inode {
> @@ -203,6 +204,7 @@ blkdev_get_blocks(struct inode *inode, s
>  
>  	bh->b_bdev = I_BDEV(inode);
>  	bh->b_blocknr = iblock;
> +	msleep(1000);
>  	bh->b_size = max_blocks << inode->i_blkbits;
>  	if (max_blocks)
>  		set_buffer_mapped(bh);
> 
> Use some device with 4k blocksize, for example a ramdisk.
> Run "dd if=/dev/ram0 of=/dev/null bs=4k count=1 iflag=direct"
> While it is sleeping in the msleep function, run "blockdev --setbsz 2048 
> /dev/ram0" on the other console.
> You get a BUG at fs/direct-io.c:1013 - BUG_ON(this_chunk_bytes == 0);
> 
> 
> One may ask "why would anyone do this - submit I/O and change block size 
> simultaneously?" - the problem is that udev and lvm can scan and read all 
> block devices anytime - so anytime you change block device size, there may 
> be some i/o to that device in flight and the crash may happen. That BUG 
> actually happened in production environment because of lvm scanning block 
> devices and some other software changing block size at the same time.
> 
> 
> I would like to know, what is your opinion on fixing this crash? There are 
> several possibilities:
> 
> * we could potentially read i_blkbits once, store it in the direct i/o 
> structure and never read it again - direct i/o could be maybe modified for 
> this (it reads i_blkbits only at a few places). But what about non-direct 
> i/o? Non-direct i/o is reading i_blkbits much more often and the code was 
> obviously written without consideration that it may change - for block 
> devices, i_blkbits is essentially a random value that can change anytime 
> you read it and the code of block_read_full_page, __block_write_begin, 
> __block_write_full_page and others doesn't seem to take it into account.
> 
> * put some rw-lock arond all I/Os on block device. The rw-lock would be 
> taken for read on all I/O paths and it would be taken for write when 
> changing the block device size. The downside would be a possible 
> performance hit of the rw-lock. The rw-lock could be made per-cpu to avoid 
> cache line bouncing (take the rw-lock belonging to the current cpu for 
> read; for write take all cpus' locks).
> 
> * allow changing block size only if the device is open only once and the 
> process is singlethreaded? (so there couldn't be any outstanding I/Os). I 
> don't know if this could be tested reliably... Another question: what to 
> do if the device is open multiple times?
> 
> Do you have any other ideas what to do with it?
  Yeah, it's nasty and neither solution looks particularly appealing. One
idea that came to my mind is: I'm trying to solve some races between direct
IO, buffered IO, hole punching etc. by a new mapping interval lock. I'm not
sure if it will go anywhere yet but if it does, we can fix the above race
by taking the mapping lock for the whole block device around setting block
size thus effectivelly disallowing any IO to it.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Crash when IO is being submitted and block size is changed
  2012-06-28 11:15 ` Jan Kara
@ 2012-06-28 15:44   ` Mikulas Patocka
  2012-06-28 16:53     ` Jan Kara
  2012-07-16  0:55   ` Mikulas Patocka
  1 sibling, 1 reply; 45+ messages in thread
From: Mikulas Patocka @ 2012-06-28 15:44 UTC (permalink / raw)
  To: Jan Kara
  Cc: Alexander Viro, Jens Axboe, Alasdair G. Kergon, linux-kernel,
	linux-fsdevel, dm-devel



On Thu, 28 Jun 2012, Jan Kara wrote:

> > Do you have any other ideas what to do with it?
>   Yeah, it's nasty and neither solution looks particularly appealing. One
> idea that came to my mind is: I'm trying to solve some races between direct
> IO, buffered IO, hole punching etc. by a new mapping interval lock. I'm not
> sure if it will go anywhere yet but if it does, we can fix the above race
> by taking the mapping lock for the whole block device around setting block
> size thus effectivelly disallowing any IO to it.
> 
> 								Honza
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR

What races are you trying to solve? There used to be i_alloc_mem that 
prevented direct i/o while the file is being truncated, but it disappeared 
in recent kernels...

Mikulas

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Crash when IO is being submitted and block size is changed
  2012-06-28 15:44   ` Mikulas Patocka
@ 2012-06-28 16:53     ` Jan Kara
  0 siblings, 0 replies; 45+ messages in thread
From: Jan Kara @ 2012-06-28 16:53 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Jan Kara, Alexander Viro, Jens Axboe, Alasdair G. Kergon,
	linux-kernel, linux-fsdevel, dm-devel

On Thu 28-06-12 11:44:03, Mikulas Patocka wrote:
> On Thu, 28 Jun 2012, Jan Kara wrote:
> 
> > > Do you have any other ideas what to do with it?
> >   Yeah, it's nasty and neither solution looks particularly appealing. One
> > idea that came to my mind is: I'm trying to solve some races between direct
> > IO, buffered IO, hole punching etc. by a new mapping interval lock. I'm not
> > sure if it will go anywhere yet but if it does, we can fix the above race
> > by taking the mapping lock for the whole block device around setting block
> > size thus effectivelly disallowing any IO to it.
> > 
> > 								Honza
> > -- 
> > Jan Kara <jack@suse.cz>
> > SUSE Labs, CR
> 
> What races are you trying to solve? There used to be i_alloc_mem that 
> prevented direct i/o while the file is being truncated, but it disappeared 
> in recent kernels...
  i_alloc_sem has been replaced by inode_dio_wait() and friends. The
problem is mainly with hole punching - see the thread starting here:
http://www.spinics.net/lists/linux-ext4/msg32059.html

								Honza

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Crash when IO is being submitted and block size is changed
  2012-06-28  3:04 Crash when IO is being submitted and block size is changed Mikulas Patocka
  2012-06-28 11:15 ` Jan Kara
@ 2012-06-29  6:25 ` Vyacheslav Dubeyko
  1 sibling, 0 replies; 45+ messages in thread
From: Vyacheslav Dubeyko @ 2012-06-29  6:25 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Alexander Viro, Jens Axboe, Alasdair G. Kergon, linux-kernel,
	linux-fsdevel, linux-mm, dm-devel, Vyacheslav.Dubeyko

Hi,

I have some simple idea. Maybe it does nothing. What about physical
sector size? Anyway, block size can be measured as in bytes as in
sectors count. The block size can't be lesser than physical sector size
during any changing of it. Thereby, any block contains of any count of
sectors. And if processing of submitted I/O will be on sector basis then
it doesn't matter how block size was changed.

With the best regards,
Vyacheslav Dubeyko. 

On Wed, 2012-06-27 at 23:04 -0400, Mikulas Patocka wrote:
> Hi
> 
> The kernel crashes when IO is being submitted to a block device and block 
> size of that device is changed simultaneously.
> 
> To reproduce the crash, apply this patch:
> 
> --- linux-3.4.3-fast.orig/fs/block_dev.c 2012-06-27 20:24:07.000000000 +0200
> +++ linux-3.4.3-fast/fs/block_dev.c 2012-06-27 20:28:34.000000000 +0200
> @@ -28,6 +28,7 @@
>  #include <linux/log2.h>
>  #include <linux/cleancache.h>
>  #include <asm/uaccess.h> 
> +#include <linux/delay.h>
>  #include "internal.h"
>  struct bdev_inode {
> @@ -203,6 +204,7 @@ blkdev_get_blocks(struct inode *inode, s
>  
>  	bh->b_bdev = I_BDEV(inode);
>  	bh->b_blocknr = iblock;
> +	msleep(1000);
>  	bh->b_size = max_blocks << inode->i_blkbits;
>  	if (max_blocks)
>  		set_buffer_mapped(bh);
> 
> Use some device with 4k blocksize, for example a ramdisk.
> Run "dd if=/dev/ram0 of=/dev/null bs=4k count=1 iflag=direct"
> While it is sleeping in the msleep function, run "blockdev --setbsz 2048 
> /dev/ram0" on the other console.
> You get a BUG at fs/direct-io.c:1013 - BUG_ON(this_chunk_bytes == 0);
> 
> 
> One may ask "why would anyone do this - submit I/O and change block size 
> simultaneously?" - the problem is that udev and lvm can scan and read all 
> block devices anytime - so anytime you change block device size, there may 
> be some i/o to that device in flight and the crash may happen. That BUG 
> actually happened in production environment because of lvm scanning block 
> devices and some other software changing block size at the same time.
> 
> 
> I would like to know, what is your opinion on fixing this crash? There are 
> several possibilities:
> 
> * we could potentially read i_blkbits once, store it in the direct i/o 
> structure and never read it again - direct i/o could be maybe modified for 
> this (it reads i_blkbits only at a few places). But what about non-direct 
> i/o? Non-direct i/o is reading i_blkbits much more often and the code was 
> obviously written without consideration that it may change - for block 
> devices, i_blkbits is essentially a random value that can change anytime 
> you read it and the code of block_read_full_page, __block_write_begin, 
> __block_write_full_page and others doesn't seem to take it into account.
> 
> * put some rw-lock arond all I/Os on block device. The rw-lock would be 
> taken for read on all I/O paths and it would be taken for write when 
> changing the block device size. The downside would be a possible 
> performance hit of the rw-lock. The rw-lock could be made per-cpu to avoid 
> cache line bouncing (take the rw-lock belonging to the current cpu for 
> read; for write take all cpus' locks).
> 
> * allow changing block size only if the device is open only once and the 
> process is singlethreaded? (so there couldn't be any outstanding I/Os). I 
> don't know if this could be tested reliably... Another question: what to 
> do if the device is open multiple times?
> 
> Do you have any other ideas what to do with it?
> 
> Mikulas
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Crash when IO is being submitted and block size is changed
  2012-06-28 11:15 ` Jan Kara
  2012-06-28 15:44   ` Mikulas Patocka
@ 2012-07-16  0:55   ` Mikulas Patocka
  2012-07-17 19:19     ` Jeff Moyer
  1 sibling, 1 reply; 45+ messages in thread
From: Mikulas Patocka @ 2012-07-16  0:55 UTC (permalink / raw)
  To: Jan Kara
  Cc: Alexander Viro, Jens Axboe, Alasdair G. Kergon, linux-kernel,
	linux-fsdevel, linux-mm, dm-devel



On Thu, 28 Jun 2012, Jan Kara wrote:

> On Wed 27-06-12 23:04:09, Mikulas Patocka wrote:
> > The kernel crashes when IO is being submitted to a block device and block 
> > size of that device is changed simultaneously.
>   Nasty ;-)
> 
> > To reproduce the crash, apply this patch:
> > 
> > --- linux-3.4.3-fast.orig/fs/block_dev.c 2012-06-27 20:24:07.000000000 +0200
> > +++ linux-3.4.3-fast/fs/block_dev.c 2012-06-27 20:28:34.000000000 +0200
> > @@ -28,6 +28,7 @@
> >  #include <linux/log2.h>
> >  #include <linux/cleancache.h>
> >  #include <asm/uaccess.h> 
> > +#include <linux/delay.h>
> >  #include "internal.h"
> >  struct bdev_inode {
> > @@ -203,6 +204,7 @@ blkdev_get_blocks(struct inode *inode, s
> >  
> >  	bh->b_bdev = I_BDEV(inode);
> >  	bh->b_blocknr = iblock;
> > +	msleep(1000);
> >  	bh->b_size = max_blocks << inode->i_blkbits;
> >  	if (max_blocks)
> >  		set_buffer_mapped(bh);
> > 
> > Use some device with 4k blocksize, for example a ramdisk.
> > Run "dd if=/dev/ram0 of=/dev/null bs=4k count=1 iflag=direct"
> > While it is sleeping in the msleep function, run "blockdev --setbsz 2048 
> > /dev/ram0" on the other console.
> > You get a BUG at fs/direct-io.c:1013 - BUG_ON(this_chunk_bytes == 0);
> > 
> > 
> > One may ask "why would anyone do this - submit I/O and change block size 
> > simultaneously?" - the problem is that udev and lvm can scan and read all 
> > block devices anytime - so anytime you change block device size, there may 
> > be some i/o to that device in flight and the crash may happen. That BUG 
> > actually happened in production environment because of lvm scanning block 
> > devices and some other software changing block size at the same time.
> > 
>   Yeah, it's nasty and neither solution looks particularly appealing. One
> idea that came to my mind is: I'm trying to solve some races between direct
> IO, buffered IO, hole punching etc. by a new mapping interval lock. I'm not
> sure if it will go anywhere yet but if it does, we can fix the above race
> by taking the mapping lock for the whole block device around setting block
> size thus effectivelly disallowing any IO to it.
> 
> 								Honza
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
> 

Hi

This is the patch that fixes this crash: it takes a rw-semaphore around 
all direct-IO path.

(note that if someone is concerned about performance, the rw-semaphore 
could be made per-cpu --- take it for read on the current CPU and take it 
for write on all CPUs).

Mikulas

---

blockdev: fix a crash when block size is changed and I/O is issued simultaneously

The kernel may crash when block size is changed and I/O is issued
simultaneously.

Because some subsystems (udev or lvm) may read any block device anytime,
the bug actually puts any code that changes a block device size in
jeopardy.

The crash can be reproduced if you place "msleep(1000)" to
blkdev_get_blocks just before "bh->b_size = max_blocks <<
inode->i_blkbits;".
Then, run "dd if=/dev/ram0 of=/dev/null bs=4k count=1 iflag=direct"
While it is waiting in msleep, run "blockdev --setbsz 2048 /dev/ram0"
You get a BUG.

The direct and non-direct I/O is written with the assumption that block
size does not change. It doesn't seem practical to fix these crashes
one-by-one there may be many crash possibilities when block size changes
at a certain place and it is impossible to find them all and verify the
code.

This patch introduces a new rw-lock bd_block_size_semaphore. The lock is
taken for read during I/O. It is taken for write when changing block
size. Consequently, block size can't be changed while I/O is being
submitted.

For asynchronous I/O, the patch only prevents block size change while
the I/O is being submitted. The block size can change when the I/O is in
progress or when the I/O is being finished. This is acceptable because
there are no accesses to block size when asynchronous I/O is being
finished.

The patch prevents block size changing while the device is mapped with
mmap.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 drivers/char/raw.c |    2 -
 fs/block_dev.c     |   58 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 include/linux/fs.h |    4 +++
 3 files changed, 61 insertions(+), 3 deletions(-)

Index: linux-3.5-rc6-devel/include/linux/fs.h
===================================================================
--- linux-3.5-rc6-devel.orig/include/linux/fs.h	2012-07-16 01:18:45.000000000 +0200
+++ linux-3.5-rc6-devel/include/linux/fs.h	2012-07-16 01:29:21.000000000 +0200
@@ -713,6 +713,8 @@ struct block_device {
 	int			bd_fsfreeze_count;
 	/* Mutex for freeze */
 	struct mutex		bd_fsfreeze_mutex;
+	/* A semaphore that prevents I/O while block size is being changed */
+	struct rw_semaphore	bd_block_size_semaphore;
 };
 
 /*
@@ -2414,6 +2416,8 @@ extern int generic_segment_checks(const 
 		unsigned long *nr_segs, size_t *count, int access_flags);
 
 /* fs/block_dev.c */
+extern ssize_t blkdev_aio_read(struct kiocb *iocb, const struct iovec *iov,
+			       unsigned long nr_segs, loff_t pos);
 extern ssize_t blkdev_aio_write(struct kiocb *iocb, const struct iovec *iov,
 				unsigned long nr_segs, loff_t pos);
 extern int blkdev_fsync(struct file *filp, loff_t start, loff_t end,
Index: linux-3.5-rc6-devel/fs/block_dev.c
===================================================================
--- linux-3.5-rc6-devel.orig/fs/block_dev.c	2012-07-16 01:14:33.000000000 +0200
+++ linux-3.5-rc6-devel/fs/block_dev.c	2012-07-16 01:37:28.000000000 +0200
@@ -124,6 +124,20 @@ int set_blocksize(struct block_device *b
 	if (size < bdev_logical_block_size(bdev))
 		return -EINVAL;
 
+	/* Prevent starting I/O or mapping the device */
+	down_write(&bdev->bd_block_size_semaphore);
+
+	/* Check that the block device is not memory mapped */
+	mapping = bdev->bd_inode->i_mapping;
+	mutex_lock(&mapping->i_mmap_mutex);
+	if (!prio_tree_empty(&mapping->i_mmap) ||
+	    !list_empty(&mapping->i_mmap_nonlinear)) {
+		mutex_unlock(&mapping->i_mmap_mutex);
+		up_write(&bdev->bd_block_size_semaphore);
+		return -EBUSY;
+	}
+	mutex_unlock(&mapping->i_mmap_mutex);
+
 	/* Don't change the size if it is same as current */
 	if (bdev->bd_block_size != size) {
 		sync_blockdev(bdev);
@@ -131,6 +145,9 @@ int set_blocksize(struct block_device *b
 		bdev->bd_inode->i_blkbits = blksize_bits(size);
 		kill_bdev(bdev);
 	}
+
+	up_write(&bdev->bd_block_size_semaphore);
+
 	return 0;
 }
 
@@ -472,6 +489,7 @@ static void init_once(void *foo)
 	inode_init_once(&ei->vfs_inode);
 	/* Initialize mutex for freeze. */
 	mutex_init(&bdev->bd_fsfreeze_mutex);
+	init_rwsem(&bdev->bd_block_size_semaphore);
 }
 
 static inline void __bd_forget(struct inode *inode)
@@ -1567,6 +1585,22 @@ static long block_ioctl(struct file *fil
 	return blkdev_ioctl(bdev, mode, cmd, arg);
 }
 
+ssize_t blkdev_aio_read(struct kiocb *iocb, const struct iovec *iov,
+			unsigned long nr_segs, loff_t pos)
+{
+	ssize_t ret;
+	struct block_device *bdev = I_BDEV(iocb->ki_filp->f_mapping->host);
+
+	down_read(&bdev->bd_block_size_semaphore);
+
+	ret = generic_file_aio_read(iocb, iov, nr_segs, pos);
+
+	up_read(&bdev->bd_block_size_semaphore);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(blkdev_aio_read);
+
 /*
  * Write data to the block device.  Only intended for the block device itself
  * and the raw driver which basically is a fake block device.
@@ -1578,10 +1612,13 @@ ssize_t blkdev_aio_write(struct kiocb *i
 			 unsigned long nr_segs, loff_t pos)
 {
 	struct file *file = iocb->ki_filp;
+	struct block_device *bdev = I_BDEV(file->f_mapping->host);
 	ssize_t ret;
 
 	BUG_ON(iocb->ki_pos != pos);
 
+	down_read(&bdev->bd_block_size_semaphore);
+
 	ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
 	if (ret > 0 || ret == -EIOCBQUEUED) {
 		ssize_t err;
@@ -1590,10 +1627,27 @@ ssize_t blkdev_aio_write(struct kiocb *i
 		if (err < 0 && ret > 0)
 			ret = err;
 	}
+
+	up_read(&bdev->bd_block_size_semaphore);
+
 	return ret;
 }
 EXPORT_SYMBOL_GPL(blkdev_aio_write);
 
+int blkdev_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	int ret;
+	struct block_device *bdev = I_BDEV(file->f_mapping->host);
+
+	down_read(&bdev->bd_block_size_semaphore);
+
+	ret = generic_file_mmap(file, vma);
+
+	up_read(&bdev->bd_block_size_semaphore);
+
+	return ret;
+}
+
 /*
  * Try to release a page associated with block device when the system
  * is under memory pressure.
@@ -1624,9 +1678,9 @@ const struct file_operations def_blk_fop
 	.llseek		= block_llseek,
 	.read		= do_sync_read,
 	.write		= do_sync_write,
-  	.aio_read	= generic_file_aio_read,
+  	.aio_read	= blkdev_aio_read,
 	.aio_write	= blkdev_aio_write,
-	.mmap		= generic_file_mmap,
+	.mmap		= blkdev_mmap,
 	.fsync		= blkdev_fsync,
 	.unlocked_ioctl	= block_ioctl,
 #ifdef CONFIG_COMPAT
Index: linux-3.5-rc6-devel/drivers/char/raw.c
===================================================================
--- linux-3.5-rc6-devel.orig/drivers/char/raw.c	2012-07-16 01:29:27.000000000 +0200
+++ linux-3.5-rc6-devel/drivers/char/raw.c	2012-07-16 01:30:04.000000000 +0200
@@ -285,7 +285,7 @@ static long raw_ctl_compat_ioctl(struct 
 
 static const struct file_operations raw_fops = {
 	.read		= do_sync_read,
-	.aio_read	= generic_file_aio_read,
+	.aio_read	= blkdev_aio_read,
 	.write		= do_sync_write,
 	.aio_write	= blkdev_aio_write,
 	.fsync		= blkdev_fsync,

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Crash when IO is being submitted and block size is changed
  2012-07-16  0:55   ` Mikulas Patocka
@ 2012-07-17 19:19     ` Jeff Moyer
  2012-07-19  2:27       ` Mikulas Patocka
  0 siblings, 1 reply; 45+ messages in thread
From: Jeff Moyer @ 2012-07-17 19:19 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Jan Kara, Alexander Viro, Jens Axboe, Alasdair G. Kergon,
	linux-kernel, linux-fsdevel, linux-mm, dm-devel, lwoodman,
	Andrea Arcangeli, kosaki.motohiro

Mikulas Patocka <mpatocka@redhat.com> writes:

> On Thu, 28 Jun 2012, Jan Kara wrote:
>
>> On Wed 27-06-12 23:04:09, Mikulas Patocka wrote:
>> > The kernel crashes when IO is being submitted to a block device and block 
>> > size of that device is changed simultaneously.
>>   Nasty ;-)
>> 
>> > To reproduce the crash, apply this patch:
>> > 
>> > --- linux-3.4.3-fast.orig/fs/block_dev.c 2012-06-27 20:24:07.000000000 +0200
>> > +++ linux-3.4.3-fast/fs/block_dev.c 2012-06-27 20:28:34.000000000 +0200
>> > @@ -28,6 +28,7 @@
>> >  #include <linux/log2.h>
>> >  #include <linux/cleancache.h>
>> >  #include <asm/uaccess.h> 
>> > +#include <linux/delay.h>
>> >  #include "internal.h"
>> >  struct bdev_inode {
>> > @@ -203,6 +204,7 @@ blkdev_get_blocks(struct inode *inode, s
>> >  
>> >  	bh->b_bdev = I_BDEV(inode);
>> >  	bh->b_blocknr = iblock;
>> > +	msleep(1000);
>> >  	bh->b_size = max_blocks << inode->i_blkbits;
>> >  	if (max_blocks)
>> >  		set_buffer_mapped(bh);
>> > 
>> > Use some device with 4k blocksize, for example a ramdisk.
>> > Run "dd if=/dev/ram0 of=/dev/null bs=4k count=1 iflag=direct"
>> > While it is sleeping in the msleep function, run "blockdev --setbsz 2048 
>> > /dev/ram0" on the other console.
>> > You get a BUG at fs/direct-io.c:1013 - BUG_ON(this_chunk_bytes == 0);
>> > 
>> > 
>> > One may ask "why would anyone do this - submit I/O and change block size 
>> > simultaneously?" - the problem is that udev and lvm can scan and read all 
>> > block devices anytime - so anytime you change block device size, there may 
>> > be some i/o to that device in flight and the crash may happen. That BUG 
>> > actually happened in production environment because of lvm scanning block 
>> > devices and some other software changing block size at the same time.
>> > 
>>   Yeah, it's nasty and neither solution looks particularly appealing. One
>> idea that came to my mind is: I'm trying to solve some races between direct
>> IO, buffered IO, hole punching etc. by a new mapping interval lock. I'm not
>> sure if it will go anywhere yet but if it does, we can fix the above race
>> by taking the mapping lock for the whole block device around setting block
>> size thus effectivelly disallowing any IO to it.
>> 
>> 								Honza
>> -- 
>> Jan Kara <jack@suse.cz>
>> SUSE Labs, CR
>> 
>
> Hi
>
> This is the patch that fixes this crash: it takes a rw-semaphore around 
> all direct-IO path.
>
> (note that if someone is concerned about performance, the rw-semaphore 
> could be made per-cpu --- take it for read on the current CPU and take it 
> for write on all CPUs).

Here we go again.  :-)  I believe we had at one point tried taking a rw
semaphore around GUP inside of the direct I/O code path to fix the fork
vs. GUP race (that still exists today).  When testing that, the overhead
of the semaphore was *way* too high to be considered an acceptable
solution.  I've CC'd Larry Woodman, Andrea, and Kosaki Motohiro who all
worked on that particular bug.  Hopefully they can give better
quantification of the slowdown than my poor memory.

Cheers,
Jeff


> blockdev: fix a crash when block size is changed and I/O is issued simultaneously
>
> The kernel may crash when block size is changed and I/O is issued
> simultaneously.
>
> Because some subsystems (udev or lvm) may read any block device anytime,
> the bug actually puts any code that changes a block device size in
> jeopardy.
>
> The crash can be reproduced if you place "msleep(1000)" to
> blkdev_get_blocks just before "bh->b_size = max_blocks <<
> inode->i_blkbits;".
> Then, run "dd if=/dev/ram0 of=/dev/null bs=4k count=1 iflag=direct"
> While it is waiting in msleep, run "blockdev --setbsz 2048 /dev/ram0"
> You get a BUG.
>
> The direct and non-direct I/O is written with the assumption that block
> size does not change. It doesn't seem practical to fix these crashes
> one-by-one there may be many crash possibilities when block size changes
> at a certain place and it is impossible to find them all and verify the
> code.
>
> This patch introduces a new rw-lock bd_block_size_semaphore. The lock is
> taken for read during I/O. It is taken for write when changing block
> size. Consequently, block size can't be changed while I/O is being
> submitted.
>
> For asynchronous I/O, the patch only prevents block size change while
> the I/O is being submitted. The block size can change when the I/O is in
> progress or when the I/O is being finished. This is acceptable because
> there are no accesses to block size when asynchronous I/O is being
> finished.
>
> The patch prevents block size changing while the device is mapped with
> mmap.
>
> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
>
> ---
>  drivers/char/raw.c |    2 -
>  fs/block_dev.c     |   58 +++++++++++++++++++++++++++++++++++++++++++++++++++--
>  include/linux/fs.h |    4 +++
>  3 files changed, 61 insertions(+), 3 deletions(-)
>
> Index: linux-3.5-rc6-devel/include/linux/fs.h
> ===================================================================
> --- linux-3.5-rc6-devel.orig/include/linux/fs.h	2012-07-16 01:18:45.000000000 +0200
> +++ linux-3.5-rc6-devel/include/linux/fs.h	2012-07-16 01:29:21.000000000 +0200
> @@ -713,6 +713,8 @@ struct block_device {
>  	int			bd_fsfreeze_count;
>  	/* Mutex for freeze */
>  	struct mutex		bd_fsfreeze_mutex;
> +	/* A semaphore that prevents I/O while block size is being changed */
> +	struct rw_semaphore	bd_block_size_semaphore;
>  };
>  
>  /*
> @@ -2414,6 +2416,8 @@ extern int generic_segment_checks(const 
>  		unsigned long *nr_segs, size_t *count, int access_flags);
>  
>  /* fs/block_dev.c */
> +extern ssize_t blkdev_aio_read(struct kiocb *iocb, const struct iovec *iov,
> +			       unsigned long nr_segs, loff_t pos);
>  extern ssize_t blkdev_aio_write(struct kiocb *iocb, const struct iovec *iov,
>  				unsigned long nr_segs, loff_t pos);
>  extern int blkdev_fsync(struct file *filp, loff_t start, loff_t end,
> Index: linux-3.5-rc6-devel/fs/block_dev.c
> ===================================================================
> --- linux-3.5-rc6-devel.orig/fs/block_dev.c	2012-07-16 01:14:33.000000000 +0200
> +++ linux-3.5-rc6-devel/fs/block_dev.c	2012-07-16 01:37:28.000000000 +0200
> @@ -124,6 +124,20 @@ int set_blocksize(struct block_device *b
>  	if (size < bdev_logical_block_size(bdev))
>  		return -EINVAL;
>  
> +	/* Prevent starting I/O or mapping the device */
> +	down_write(&bdev->bd_block_size_semaphore);
> +
> +	/* Check that the block device is not memory mapped */
> +	mapping = bdev->bd_inode->i_mapping;
> +	mutex_lock(&mapping->i_mmap_mutex);
> +	if (!prio_tree_empty(&mapping->i_mmap) ||
> +	    !list_empty(&mapping->i_mmap_nonlinear)) {
> +		mutex_unlock(&mapping->i_mmap_mutex);
> +		up_write(&bdev->bd_block_size_semaphore);
> +		return -EBUSY;
> +	}
> +	mutex_unlock(&mapping->i_mmap_mutex);
> +
>  	/* Don't change the size if it is same as current */
>  	if (bdev->bd_block_size != size) {
>  		sync_blockdev(bdev);
> @@ -131,6 +145,9 @@ int set_blocksize(struct block_device *b
>  		bdev->bd_inode->i_blkbits = blksize_bits(size);
>  		kill_bdev(bdev);
>  	}
> +
> +	up_write(&bdev->bd_block_size_semaphore);
> +
>  	return 0;
>  }
>  
> @@ -472,6 +489,7 @@ static void init_once(void *foo)
>  	inode_init_once(&ei->vfs_inode);
>  	/* Initialize mutex for freeze. */
>  	mutex_init(&bdev->bd_fsfreeze_mutex);
> +	init_rwsem(&bdev->bd_block_size_semaphore);
>  }
>  
>  static inline void __bd_forget(struct inode *inode)
> @@ -1567,6 +1585,22 @@ static long block_ioctl(struct file *fil
>  	return blkdev_ioctl(bdev, mode, cmd, arg);
>  }
>  
> +ssize_t blkdev_aio_read(struct kiocb *iocb, const struct iovec *iov,
> +			unsigned long nr_segs, loff_t pos)
> +{
> +	ssize_t ret;
> +	struct block_device *bdev = I_BDEV(iocb->ki_filp->f_mapping->host);
> +
> +	down_read(&bdev->bd_block_size_semaphore);
> +
> +	ret = generic_file_aio_read(iocb, iov, nr_segs, pos);
> +
> +	up_read(&bdev->bd_block_size_semaphore);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(blkdev_aio_read);
> +
>  /*
>   * Write data to the block device.  Only intended for the block device itself
>   * and the raw driver which basically is a fake block device.
> @@ -1578,10 +1612,13 @@ ssize_t blkdev_aio_write(struct kiocb *i
>  			 unsigned long nr_segs, loff_t pos)
>  {
>  	struct file *file = iocb->ki_filp;
> +	struct block_device *bdev = I_BDEV(file->f_mapping->host);
>  	ssize_t ret;
>  
>  	BUG_ON(iocb->ki_pos != pos);
>  
> +	down_read(&bdev->bd_block_size_semaphore);
> +
>  	ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
>  	if (ret > 0 || ret == -EIOCBQUEUED) {
>  		ssize_t err;
> @@ -1590,10 +1627,27 @@ ssize_t blkdev_aio_write(struct kiocb *i
>  		if (err < 0 && ret > 0)
>  			ret = err;
>  	}
> +
> +	up_read(&bdev->bd_block_size_semaphore);
> +
>  	return ret;
>  }
>  EXPORT_SYMBOL_GPL(blkdev_aio_write);
>  
> +int blkdev_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> +	int ret;
> +	struct block_device *bdev = I_BDEV(file->f_mapping->host);
> +
> +	down_read(&bdev->bd_block_size_semaphore);
> +
> +	ret = generic_file_mmap(file, vma);
> +
> +	up_read(&bdev->bd_block_size_semaphore);
> +
> +	return ret;
> +}
> +
>  /*
>   * Try to release a page associated with block device when the system
>   * is under memory pressure.
> @@ -1624,9 +1678,9 @@ const struct file_operations def_blk_fop
>  	.llseek		= block_llseek,
>  	.read		= do_sync_read,
>  	.write		= do_sync_write,
> -  	.aio_read	= generic_file_aio_read,
> +  	.aio_read	= blkdev_aio_read,
>  	.aio_write	= blkdev_aio_write,
> -	.mmap		= generic_file_mmap,
> +	.mmap		= blkdev_mmap,
>  	.fsync		= blkdev_fsync,
>  	.unlocked_ioctl	= block_ioctl,
>  #ifdef CONFIG_COMPAT
> Index: linux-3.5-rc6-devel/drivers/char/raw.c
> ===================================================================
> --- linux-3.5-rc6-devel.orig/drivers/char/raw.c	2012-07-16 01:29:27.000000000 +0200
> +++ linux-3.5-rc6-devel/drivers/char/raw.c	2012-07-16 01:30:04.000000000 +0200
> @@ -285,7 +285,7 @@ static long raw_ctl_compat_ioctl(struct 
>  
>  static const struct file_operations raw_fops = {
>  	.read		= do_sync_read,
> -	.aio_read	= generic_file_aio_read,
> +	.aio_read	= blkdev_aio_read,
>  	.write		= do_sync_write,
>  	.aio_write	= blkdev_aio_write,
>  	.fsync		= blkdev_fsync,
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Crash when IO is being submitted and block size is changed
  2012-07-17 19:19     ` Jeff Moyer
@ 2012-07-19  2:27       ` Mikulas Patocka
  2012-07-19 13:33         ` Jeff Moyer
  0 siblings, 1 reply; 45+ messages in thread
From: Mikulas Patocka @ 2012-07-19  2:27 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Jan Kara, Alexander Viro, Jens Axboe, Alasdair G. Kergon,
	linux-kernel, linux-fsdevel, dm-devel, lwoodman,
	Andrea Arcangeli, kosaki.motohiro



On Tue, 17 Jul 2012, Jeff Moyer wrote:

> Mikulas Patocka <mpatocka@redhat.com> writes:
> 
> > On Thu, 28 Jun 2012, Jan Kara wrote:
> >
> >> On Wed 27-06-12 23:04:09, Mikulas Patocka wrote:
> >> > The kernel crashes when IO is being submitted to a block device and block 
> >> > size of that device is changed simultaneously.
> >>   Nasty ;-)
> >> 
> >> > To reproduce the crash, apply this patch:
> >> > 
> >> > --- linux-3.4.3-fast.orig/fs/block_dev.c 2012-06-27 20:24:07.000000000 +0200
> >> > +++ linux-3.4.3-fast/fs/block_dev.c 2012-06-27 20:28:34.000000000 +0200
> >> > @@ -28,6 +28,7 @@
> >> >  #include <linux/log2.h>
> >> >  #include <linux/cleancache.h>
> >> >  #include <asm/uaccess.h> 
> >> > +#include <linux/delay.h>
> >> >  #include "internal.h"
> >> >  struct bdev_inode {
> >> > @@ -203,6 +204,7 @@ blkdev_get_blocks(struct inode *inode, s
> >> >  
> >> >  	bh->b_bdev = I_BDEV(inode);
> >> >  	bh->b_blocknr = iblock;
> >> > +	msleep(1000);
> >> >  	bh->b_size = max_blocks << inode->i_blkbits;
> >> >  	if (max_blocks)
> >> >  		set_buffer_mapped(bh);
> >> > 
> >> > Use some device with 4k blocksize, for example a ramdisk.
> >> > Run "dd if=/dev/ram0 of=/dev/null bs=4k count=1 iflag=direct"
> >> > While it is sleeping in the msleep function, run "blockdev --setbsz 2048 
> >> > /dev/ram0" on the other console.
> >> > You get a BUG at fs/direct-io.c:1013 - BUG_ON(this_chunk_bytes == 0);
> >> > 
> >> > 
> >> > One may ask "why would anyone do this - submit I/O and change block size 
> >> > simultaneously?" - the problem is that udev and lvm can scan and read all 
> >> > block devices anytime - so anytime you change block device size, there may 
> >> > be some i/o to that device in flight and the crash may happen. That BUG 
> >> > actually happened in production environment because of lvm scanning block 
> >> > devices and some other software changing block size at the same time.
> >> > 
> >>   Yeah, it's nasty and neither solution looks particularly appealing. One
> >> idea that came to my mind is: I'm trying to solve some races between direct
> >> IO, buffered IO, hole punching etc. by a new mapping interval lock. I'm not
> >> sure if it will go anywhere yet but if it does, we can fix the above race
> >> by taking the mapping lock for the whole block device around setting block
> >> size thus effectivelly disallowing any IO to it.
> >> 
> >> 								Honza
> >> -- 
> >> Jan Kara <jack@suse.cz>
> >> SUSE Labs, CR
> >> 
> >
> > Hi
> >
> > This is the patch that fixes this crash: it takes a rw-semaphore around 
> > all direct-IO path.
> >
> > (note that if someone is concerned about performance, the rw-semaphore 
> > could be made per-cpu --- take it for read on the current CPU and take it 
> > for write on all CPUs).
> 
> Here we go again.  :-)  I believe we had at one point tried taking a rw
> semaphore around GUP inside of the direct I/O code path to fix the fork
> vs. GUP race (that still exists today).  When testing that, the overhead
> of the semaphore was *way* too high to be considered an acceptable
> solution.  I've CC'd Larry Woodman, Andrea, and Kosaki Motohiro who all
> worked on that particular bug.  Hopefully they can give better
> quantification of the slowdown than my poor memory.
> 
> Cheers,
> Jeff

Both down_read and up_read together take 82 ticks on Core2, 69 ticks on 
AMD K10, 62 ticks on UltraSparc2 if the target is in L1 cache. So, if 
percpu rw_semaphores were used, it would slow down only by this amount.

I hope that Linux developers are not so obsessed with performance that 
they want a fast crashing kernel rather than a slow reliable kernel. Note 
that anything that changes a device block size (for example mounting a 
filesystem with non-default block size) may trigger a crash if lvm or udev 
reads the device simultaneously; the crash really happened in business 
environment).

Mikulas

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: Crash when IO is being submitted and block size is changed
  2012-07-19  2:27       ` Mikulas Patocka
@ 2012-07-19 13:33         ` Jeff Moyer
  2012-07-28 16:40           ` [PATCH 1/3] Fix " Mikulas Patocka
  0 siblings, 1 reply; 45+ messages in thread
From: Jeff Moyer @ 2012-07-19 13:33 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Jan Kara, Alexander Viro, Jens Axboe, Alasdair G. Kergon,
	linux-kernel, linux-fsdevel, dm-devel, lwoodman,
	Andrea Arcangeli, kosaki.motohiro

Mikulas Patocka <mpatocka@redhat.com> writes:

> On Tue, 17 Jul 2012, Jeff Moyer wrote:
>

>> > This is the patch that fixes this crash: it takes a rw-semaphore around 
>> > all direct-IO path.
>> >
>> > (note that if someone is concerned about performance, the rw-semaphore 
>> > could be made per-cpu --- take it for read on the current CPU and take it 
>> > for write on all CPUs).
>> 
>> Here we go again.  :-)  I believe we had at one point tried taking a rw
>> semaphore around GUP inside of the direct I/O code path to fix the fork
>> vs. GUP race (that still exists today).  When testing that, the overhead
>> of the semaphore was *way* too high to be considered an acceptable
>> solution.  I've CC'd Larry Woodman, Andrea, and Kosaki Motohiro who all
>> worked on that particular bug.  Hopefully they can give better
>> quantification of the slowdown than my poor memory.
>> 
>> Cheers,
>> Jeff
>
> Both down_read and up_read together take 82 ticks on Core2, 69 ticks on 
> AMD K10, 62 ticks on UltraSparc2 if the target is in L1 cache. So, if 
> percpu rw_semaphores were used, it would slow down only by this amount.

Sorry, I'm not familiar with per-cpu rw semaphores.  Where are they
implemented?

> I hope that Linux developers are not so obsessed with performance that
> they want a fast crashing kernel rather than a slow reliable kernel.
> Note that anything that changes a device block size (for example
> mounting a filesystem with non-default block size) may trigger a crash
> if lvm or udev reads the device simultaneously; the crash really
> happened in business environment).

I wasn't suggesting that we leave the problem unfixed (though I can see
how you might have gotten that idea, sorry for not being more clear).  I
was merely suggesting that we should try to fix the problem in a way
that does not kill performance.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 1/3] Fix Crash when IO is being submitted and block size is changed
  2012-07-19 13:33         ` Jeff Moyer
@ 2012-07-28 16:40           ` Mikulas Patocka
  2012-07-28 16:41             ` [PATCH 2/3] Introduce percpu rw semaphores Mikulas Patocka
  0 siblings, 1 reply; 45+ messages in thread
From: Mikulas Patocka @ 2012-07-28 16:40 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Jan Kara, Alexander Viro, Jens Axboe, Alasdair G. Kergon,
	linux-kernel, linux-fsdevel, dm-devel, lwoodman,
	Andrea Arcangeli, kosaki.motohiro



On Thu, 19 Jul 2012, Jeff Moyer wrote:

> Mikulas Patocka <mpatocka@redhat.com> writes:
> 
> > On Tue, 17 Jul 2012, Jeff Moyer wrote:
> >
> 
> >> > This is the patch that fixes this crash: it takes a rw-semaphore around 
> >> > all direct-IO path.
> >> >
> >> > (note that if someone is concerned about performance, the rw-semaphore 
> >> > could be made per-cpu --- take it for read on the current CPU and take it 
> >> > for write on all CPUs).
> >> 
> >> Here we go again.  :-)  I believe we had at one point tried taking a rw
> >> semaphore around GUP inside of the direct I/O code path to fix the fork
> >> vs. GUP race (that still exists today).  When testing that, the overhead
> >> of the semaphore was *way* too high to be considered an acceptable
> >> solution.  I've CC'd Larry Woodman, Andrea, and Kosaki Motohiro who all
> >> worked on that particular bug.  Hopefully they can give better
> >> quantification of the slowdown than my poor memory.
> >> 
> >> Cheers,
> >> Jeff
> >
> > Both down_read and up_read together take 82 ticks on Core2, 69 ticks on 
> > AMD K10, 62 ticks on UltraSparc2 if the target is in L1 cache. So, if 
> > percpu rw_semaphores were used, it would slow down only by this amount.
> 
> Sorry, I'm not familiar with per-cpu rw semaphores.  Where are they
> implemented?

Here I'm resending the upstream patches with per rw-semaphores - percpu 
rw-semaphores are implemented in the next patch.

(For Jeff: you can use your patch for RHEL-6 that you did for perfocmance 
testing, with the change that I proposed).

Mikulas

---

blockdev: fix a crash when block size is changed and I/O is issued simultaneously

The kernel may crash when block size is changed and I/O is issued
simultaneously.

Because some subsystems (udev or lvm) may read any block device anytime,
the bug actually puts any code that changes a block device size in
jeopardy.

The crash can be reproduced if you place "msleep(1000)" to
blkdev_get_blocks just before "bh->b_size = max_blocks <<
inode->i_blkbits;".
Then, run "dd if=/dev/ram0 of=/dev/null bs=4k count=1 iflag=direct"
While it is waiting in msleep, run "blockdev --setbsz 2048 /dev/ram0"
You get a BUG.

The direct and non-direct I/O is written with the assumption that block
size does not change. It doesn't seem practical to fix these crashes
one-by-one there may be many crash possibilities when block size changes
at a certain place and it is impossible to find them all and verify the
code.

This patch introduces a new rw-lock bd_block_size_semaphore. The lock is
taken for read during I/O. It is taken for write when changing block
size. Consequently, block size can't be changed while I/O is being
submitted.

For asynchronous I/O, the patch only prevents block size change while
the I/O is being submitted. The block size can change when the I/O is in
progress or when the I/O is being finished. This is acceptable because
there are no accesses to block size when asynchronous I/O is being
finished.

The patch prevents block size changing while the device is mapped with
mmap.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 drivers/char/raw.c |    2 -
 fs/block_dev.c     |   60 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 include/linux/fs.h |    4 +++
 3 files changed, 63 insertions(+), 3 deletions(-)

Index: linux-3.5-rc6-devel/include/linux/fs.h
===================================================================
--- linux-3.5-rc6-devel.orig/include/linux/fs.h	2012-07-16 20:20:12.000000000 +0200
+++ linux-3.5-rc6-devel/include/linux/fs.h	2012-07-16 01:29:21.000000000 +0200
@@ -713,6 +713,8 @@ struct block_device {
 	int			bd_fsfreeze_count;
 	/* Mutex for freeze */
 	struct mutex		bd_fsfreeze_mutex;
+	/* A semaphore that prevents I/O while block size is being changed */
+	struct rw_semaphore	bd_block_size_semaphore;
 };
 
 /*
@@ -2414,6 +2416,8 @@ extern int generic_segment_checks(const 
 		unsigned long *nr_segs, size_t *count, int access_flags);
 
 /* fs/block_dev.c */
+extern ssize_t blkdev_aio_read(struct kiocb *iocb, const struct iovec *iov,
+			       unsigned long nr_segs, loff_t pos);
 extern ssize_t blkdev_aio_write(struct kiocb *iocb, const struct iovec *iov,
 				unsigned long nr_segs, loff_t pos);
 extern int blkdev_fsync(struct file *filp, loff_t start, loff_t end,
Index: linux-3.5-rc6-devel/fs/block_dev.c
===================================================================
--- linux-3.5-rc6-devel.orig/fs/block_dev.c	2012-07-16 20:20:12.000000000 +0200
+++ linux-3.5-rc6-devel/fs/block_dev.c	2012-07-16 21:47:30.000000000 +0200
@@ -116,6 +116,8 @@ EXPORT_SYMBOL(invalidate_bdev);
 
 int set_blocksize(struct block_device *bdev, int size)
 {
+	struct address_space *mapping;
+
 	/* Size must be a power of two, and between 512 and PAGE_SIZE */
 	if (size > PAGE_SIZE || size < 512 || !is_power_of_2(size))
 		return -EINVAL;
@@ -124,6 +126,20 @@ int set_blocksize(struct block_device *b
 	if (size < bdev_logical_block_size(bdev))
 		return -EINVAL;
 
+	/* Prevent starting I/O or mapping the device */
+	down_write(&bdev->bd_block_size_semaphore);
+
+	/* Check that the block device is not memory mapped */
+	mapping = bdev->bd_inode->i_mapping;
+	mutex_lock(&mapping->i_mmap_mutex);
+	if (!prio_tree_empty(&mapping->i_mmap) ||
+	    !list_empty(&mapping->i_mmap_nonlinear)) {
+		mutex_unlock(&mapping->i_mmap_mutex);
+		up_write(&bdev->bd_block_size_semaphore);
+		return -EBUSY;
+	}
+	mutex_unlock(&mapping->i_mmap_mutex);
+
 	/* Don't change the size if it is same as current */
 	if (bdev->bd_block_size != size) {
 		sync_blockdev(bdev);
@@ -131,6 +147,9 @@ int set_blocksize(struct block_device *b
 		bdev->bd_inode->i_blkbits = blksize_bits(size);
 		kill_bdev(bdev);
 	}
+
+	up_write(&bdev->bd_block_size_semaphore);
+
 	return 0;
 }
 
@@ -472,6 +491,7 @@ static void init_once(void *foo)
 	inode_init_once(&ei->vfs_inode);
 	/* Initialize mutex for freeze. */
 	mutex_init(&bdev->bd_fsfreeze_mutex);
+	init_rwsem(&bdev->bd_block_size_semaphore);
 }
 
 static inline void __bd_forget(struct inode *inode)
@@ -1567,6 +1587,22 @@ static long block_ioctl(struct file *fil
 	return blkdev_ioctl(bdev, mode, cmd, arg);
 }
 
+ssize_t blkdev_aio_read(struct kiocb *iocb, const struct iovec *iov,
+			unsigned long nr_segs, loff_t pos)
+{
+	ssize_t ret;
+	struct block_device *bdev = I_BDEV(iocb->ki_filp->f_mapping->host);
+
+	down_read(&bdev->bd_block_size_semaphore);
+
+	ret = generic_file_aio_read(iocb, iov, nr_segs, pos);
+
+	up_read(&bdev->bd_block_size_semaphore);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(blkdev_aio_read);
+
 /*
  * Write data to the block device.  Only intended for the block device itself
  * and the raw driver which basically is a fake block device.
@@ -1578,10 +1614,13 @@ ssize_t blkdev_aio_write(struct kiocb *i
 			 unsigned long nr_segs, loff_t pos)
 {
 	struct file *file = iocb->ki_filp;
+	struct block_device *bdev = I_BDEV(file->f_mapping->host);
 	ssize_t ret;
 
 	BUG_ON(iocb->ki_pos != pos);
 
+	down_read(&bdev->bd_block_size_semaphore);
+
 	ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
 	if (ret > 0 || ret == -EIOCBQUEUED) {
 		ssize_t err;
@@ -1590,10 +1629,27 @@ ssize_t blkdev_aio_write(struct kiocb *i
 		if (err < 0 && ret > 0)
 			ret = err;
 	}
+
+	up_read(&bdev->bd_block_size_semaphore);
+
 	return ret;
 }
 EXPORT_SYMBOL_GPL(blkdev_aio_write);
 
+int blkdev_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	int ret;
+	struct block_device *bdev = I_BDEV(file->f_mapping->host);
+
+	down_read(&bdev->bd_block_size_semaphore);
+
+	ret = generic_file_mmap(file, vma);
+
+	up_read(&bdev->bd_block_size_semaphore);
+
+	return ret;
+}
+
 /*
  * Try to release a page associated with block device when the system
  * is under memory pressure.
@@ -1624,9 +1680,9 @@ const struct file_operations def_blk_fop
 	.llseek		= block_llseek,
 	.read		= do_sync_read,
 	.write		= do_sync_write,
-  	.aio_read	= generic_file_aio_read,
+  	.aio_read	= blkdev_aio_read,
 	.aio_write	= blkdev_aio_write,
-	.mmap		= generic_file_mmap,
+	.mmap		= blkdev_mmap,
 	.fsync		= blkdev_fsync,
 	.unlocked_ioctl	= block_ioctl,
 #ifdef CONFIG_COMPAT
Index: linux-3.5-rc6-devel/drivers/char/raw.c
===================================================================
--- linux-3.5-rc6-devel.orig/drivers/char/raw.c	2012-07-16 20:20:12.000000000 +0200
+++ linux-3.5-rc6-devel/drivers/char/raw.c	2012-07-16 01:30:04.000000000 +0200
@@ -285,7 +285,7 @@ static long raw_ctl_compat_ioctl(struct 
 
 static const struct file_operations raw_fops = {
 	.read		= do_sync_read,
-	.aio_read	= generic_file_aio_read,
+	.aio_read	= blkdev_aio_read,
 	.write		= do_sync_write,
 	.aio_write	= blkdev_aio_write,
 	.fsync		= blkdev_fsync,

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 2/3] Introduce percpu rw semaphores
  2012-07-28 16:40           ` [PATCH 1/3] Fix " Mikulas Patocka
@ 2012-07-28 16:41             ` Mikulas Patocka
  2012-07-28 16:42               ` [PATCH 3/3] blockdev: turn a rw semaphore into a percpu rw semaphore Mikulas Patocka
  2012-07-28 20:44               ` [PATCH 2/3] Introduce percpu rw semaphores Eric Dumazet
  0 siblings, 2 replies; 45+ messages in thread
From: Mikulas Patocka @ 2012-07-28 16:41 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Jan Kara, Alexander Viro, Jens Axboe, Alasdair G. Kergon,
	linux-kernel, linux-fsdevel, dm-devel, lwoodman,
	Andrea Arcangeli, kosaki.motohiro

Introduce percpu rw semaphores

When many CPUs are locking a rw semaphore for read concurrently, cache
line bouncing occurs. When a CPU acquires rw semaphore for read, the
CPU writes to the cache line holding the semaphore. Consequently, the
cache line is being moved between CPUs and this slows down semaphore
acquisition.

This patch introduces new percpu rw semaphores. They are functionally
identical to existing rw semaphores, but locking the percpu rw semaphore
for read is faster and locking for write is slower.

The percpu rw semaphore is implemented as a percpu array of rw
semaphores, each semaphore for one CPU. When some thread needs to lock
the semaphore for read, only semaphore on the current CPU is locked for
read. When some thread needs to lock the semaphore for write, semaphores
for all CPUs are locked for write. This avoids cache line bouncing.

Note that the thread that is locking percpu rw semaphore may be
rescheduled, it doesn't cause bug, but cache line bouncing occurs in
this case.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 include/linux/percpu-rwsem.h |   77 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 77 insertions(+)

Index: linux-3.5-fast/include/linux/percpu-rwsem.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-3.5-fast/include/linux/percpu-rwsem.h	2012-07-28 18:41:05.000000000 +0200
@@ -0,0 +1,77 @@
+#ifndef _LINUX_PERCPU_RWSEM_H
+#define _LINUX_PERCPU_RWSEM_H
+
+#include <linux/rwsem.h>
+#include <linux/percpu.h>
+
+#ifndef CONFIG_SMP
+
+#define percpu_rw_semaphore	rw_semaphore
+#define percpu_rwsem_ptr	int
+#define percpu_down_read(x)	(down_read(x), 0)
+#define percpu_up_read(x, y)	up_read(x)
+#define percpu_down_write	down_write
+#define percpu_up_write		up_write
+#define percpu_init_rwsem(x)	(({init_rwsem(x);}), 0)
+#define percpu_free_rwsem(x)	do { } while (0)
+
+#else
+
+struct percpu_rw_semaphore {
+	struct rw_semaphore __percpu *s;
+};
+
+typedef struct rw_semaphore *percpu_rwsem_ptr;
+
+static inline percpu_rwsem_ptr percpu_down_read(struct percpu_rw_semaphore *sem)
+{
+	struct rw_semaphore *s = __this_cpu_ptr(sem->s);
+	down_read(s);
+	return s;
+}
+
+static inline void percpu_up_read(struct percpu_rw_semaphore *sem, percpu_rwsem_ptr s)
+{
+	up_read(s);
+}
+
+static inline void percpu_down_write(struct percpu_rw_semaphore *sem)
+{
+	int cpu;
+	for_each_possible_cpu(cpu) {
+		struct rw_semaphore *s = per_cpu_ptr(sem->s, cpu);
+		down_write(s);
+	}
+}
+
+static inline void percpu_up_write(struct percpu_rw_semaphore *sem)
+{
+	int cpu;
+	for_each_possible_cpu(cpu) {
+		struct rw_semaphore *s = per_cpu_ptr(sem->s, cpu);
+		up_write(s);
+	}
+}
+
+static inline int percpu_init_rwsem(struct percpu_rw_semaphore *sem)
+{
+	int cpu;
+	sem->s = alloc_percpu(struct rw_semaphore);
+	if (unlikely(!sem->s))
+		return -ENOMEM;
+	for_each_possible_cpu(cpu) {
+		struct rw_semaphore *s = per_cpu_ptr(sem->s, cpu);
+		init_rwsem(s);
+	}
+	return 0;
+}
+
+static inline void percpu_free_rwsem(struct percpu_rw_semaphore *sem)
+{
+	free_percpu(sem->s);
+	sem->s = NULL;		/* catch use after free bugs */
+}
+
+#endif
+
+#endif

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 3/3] blockdev: turn a rw semaphore into a percpu rw semaphore
  2012-07-28 16:41             ` [PATCH 2/3] Introduce percpu rw semaphores Mikulas Patocka
@ 2012-07-28 16:42               ` Mikulas Patocka
  2012-07-28 20:44               ` [PATCH 2/3] Introduce percpu rw semaphores Eric Dumazet
  1 sibling, 0 replies; 45+ messages in thread
From: Mikulas Patocka @ 2012-07-28 16:42 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Jan Kara, Alexander Viro, Jens Axboe, Alasdair G. Kergon,
	linux-kernel, linux-fsdevel, dm-devel, lwoodman,
	Andrea Arcangeli, kosaki.motohiro

blockdev: turn a rw semaphore into a percpu rw semaphore

This avoids cache line bouncing when many processes lock the semaphore
for read.

Partially based on a patch by Jeff Moyer <jmoyer@redhat.com>.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 fs/block_dev.c     |   30 ++++++++++++++++++++----------
 include/linux/fs.h |    3 ++-
 2 files changed, 22 insertions(+), 11 deletions(-)

Index: linux-3.5-fast/fs/block_dev.c
===================================================================
--- linux-3.5-fast.orig/fs/block_dev.c	2012-07-28 18:32:10.000000000 +0200
+++ linux-3.5-fast/fs/block_dev.c	2012-07-28 18:32:12.000000000 +0200
@@ -127,7 +127,7 @@ int set_blocksize(struct block_device *b
 		return -EINVAL;
 
 	/* Prevent starting I/O or mapping the device */
-	down_write(&bdev->bd_block_size_semaphore);
+	percpu_down_write(&bdev->bd_block_size_semaphore);
 
 	/* Check that the block device is not memory mapped */
 	mapping = bdev->bd_inode->i_mapping;
@@ -135,7 +135,7 @@ int set_blocksize(struct block_device *b
 	if (!prio_tree_empty(&mapping->i_mmap) ||
 	    !list_empty(&mapping->i_mmap_nonlinear)) {
 		mutex_unlock(&mapping->i_mmap_mutex);
-		up_write(&bdev->bd_block_size_semaphore);
+		percpu_up_write(&bdev->bd_block_size_semaphore);
 		return -EBUSY;
 	}
 	mutex_unlock(&mapping->i_mmap_mutex);
@@ -148,7 +148,7 @@ int set_blocksize(struct block_device *b
 		kill_bdev(bdev);
 	}
 
-	up_write(&bdev->bd_block_size_semaphore);
+	percpu_up_write(&bdev->bd_block_size_semaphore);
 
 	return 0;
 }
@@ -460,6 +460,12 @@ static struct inode *bdev_alloc_inode(st
 	struct bdev_inode *ei = kmem_cache_alloc(bdev_cachep, GFP_KERNEL);
 	if (!ei)
 		return NULL;
+
+	if (unlikely(percpu_init_rwsem(&ei->bdev.bd_block_size_semaphore))) {
+		kmem_cache_free(bdev_cachep, ei);
+		return NULL;
+	}
+
 	return &ei->vfs_inode;
 }
 
@@ -468,6 +474,8 @@ static void bdev_i_callback(struct rcu_h
 	struct inode *inode = container_of(head, struct inode, i_rcu);
 	struct bdev_inode *bdi = BDEV_I(inode);
 
+	percpu_free_rwsem(&bdi->bdev.bd_block_size_semaphore);
+
 	kmem_cache_free(bdev_cachep, bdi);
 }
 
@@ -491,7 +499,6 @@ static void init_once(void *foo)
 	inode_init_once(&ei->vfs_inode);
 	/* Initialize mutex for freeze. */
 	mutex_init(&bdev->bd_fsfreeze_mutex);
-	init_rwsem(&bdev->bd_block_size_semaphore);
 }
 
 static inline void __bd_forget(struct inode *inode)
@@ -1592,12 +1599,13 @@ ssize_t blkdev_aio_read(struct kiocb *io
 {
 	ssize_t ret;
 	struct block_device *bdev = I_BDEV(iocb->ki_filp->f_mapping->host);
+	percpu_rwsem_ptr p;
 
-	down_read(&bdev->bd_block_size_semaphore);
+	p = percpu_down_read(&bdev->bd_block_size_semaphore);
 
 	ret = generic_file_aio_read(iocb, iov, nr_segs, pos);
 
-	up_read(&bdev->bd_block_size_semaphore);
+	percpu_up_read(&bdev->bd_block_size_semaphore, p);
 
 	return ret;
 }
@@ -1616,10 +1624,11 @@ ssize_t blkdev_aio_write(struct kiocb *i
 	struct file *file = iocb->ki_filp;
 	struct block_device *bdev = I_BDEV(file->f_mapping->host);
 	ssize_t ret;
+	percpu_rwsem_ptr p;
 
 	BUG_ON(iocb->ki_pos != pos);
 
-	down_read(&bdev->bd_block_size_semaphore);
+	p = percpu_down_read(&bdev->bd_block_size_semaphore);
 
 	ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
 	if (ret > 0 || ret == -EIOCBQUEUED) {
@@ -1630,7 +1639,7 @@ ssize_t blkdev_aio_write(struct kiocb *i
 			ret = err;
 	}
 
-	up_read(&bdev->bd_block_size_semaphore);
+	percpu_up_read(&bdev->bd_block_size_semaphore, p);
 
 	return ret;
 }
@@ -1640,12 +1649,13 @@ int blkdev_mmap(struct file *file, struc
 {
 	int ret;
 	struct block_device *bdev = I_BDEV(file->f_mapping->host);
+	percpu_rwsem_ptr p;
 
-	down_read(&bdev->bd_block_size_semaphore);
+	p = percpu_down_read(&bdev->bd_block_size_semaphore);
 
 	ret = generic_file_mmap(file, vma);
 
-	up_read(&bdev->bd_block_size_semaphore);
+	percpu_up_read(&bdev->bd_block_size_semaphore, p);
 
 	return ret;
 }
Index: linux-3.5-fast/include/linux/fs.h
===================================================================
--- linux-3.5-fast.orig/include/linux/fs.h	2012-07-28 18:32:10.000000000 +0200
+++ linux-3.5-fast/include/linux/fs.h	2012-07-28 18:32:12.000000000 +0200
@@ -10,6 +10,7 @@
 #include <linux/ioctl.h>
 #include <linux/blk_types.h>
 #include <linux/types.h>
+#include <linux/percpu-rwsem.h>
 
 /*
  * It's silly to have NR_OPEN bigger than NR_FILE, but you can change
@@ -714,7 +715,7 @@ struct block_device {
 	/* Mutex for freeze */
 	struct mutex		bd_fsfreeze_mutex;
 	/* A semaphore that prevents I/O while block size is being changed */
-	struct rw_semaphore	bd_block_size_semaphore;
+	struct percpu_rw_semaphore	bd_block_size_semaphore;
 };
 
 /*


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 2/3] Introduce percpu rw semaphores
  2012-07-28 16:41             ` [PATCH 2/3] Introduce percpu rw semaphores Mikulas Patocka
  2012-07-28 16:42               ` [PATCH 3/3] blockdev: turn a rw semaphore into a percpu rw semaphore Mikulas Patocka
@ 2012-07-28 20:44               ` Eric Dumazet
  2012-07-29  5:13                 ` [dm-devel] " Mikulas Patocka
  1 sibling, 1 reply; 45+ messages in thread
From: Eric Dumazet @ 2012-07-28 20:44 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Jeff Moyer, Jan Kara, Alexander Viro, Jens Axboe,
	Alasdair G. Kergon, linux-kernel, linux-fsdevel, dm-devel,
	lwoodman, Andrea Arcangeli, kosaki.motohiro

On Sat, 2012-07-28 at 12:41 -0400, Mikulas Patocka wrote:
> Introduce percpu rw semaphores
> 
> When many CPUs are locking a rw semaphore for read concurrently, cache
> line bouncing occurs. When a CPU acquires rw semaphore for read, the
> CPU writes to the cache line holding the semaphore. Consequently, the
> cache line is being moved between CPUs and this slows down semaphore
> acquisition.
> 
> This patch introduces new percpu rw semaphores. They are functionally
> identical to existing rw semaphores, but locking the percpu rw semaphore
> for read is faster and locking for write is slower.
> 
> The percpu rw semaphore is implemented as a percpu array of rw
> semaphores, each semaphore for one CPU. When some thread needs to lock
> the semaphore for read, only semaphore on the current CPU is locked for
> read. When some thread needs to lock the semaphore for write, semaphores
> for all CPUs are locked for write. This avoids cache line bouncing.
> 
> Note that the thread that is locking percpu rw semaphore may be
> rescheduled, it doesn't cause bug, but cache line bouncing occurs in
> this case.
> 
> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

I am curious to see how this performs with 4096 cpus ?

Really you shouldnt use rwlock in a path if this might hurt performance.

RCU is probably a better answer.

(bdev->bd_block_size should be read exactly once )




^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [dm-devel] [PATCH 2/3] Introduce percpu rw semaphores
  2012-07-28 20:44               ` [PATCH 2/3] Introduce percpu rw semaphores Eric Dumazet
@ 2012-07-29  5:13                 ` Mikulas Patocka
  2012-07-29 10:10                   ` Eric Dumazet
  2012-07-30 17:00                   ` [dm-devel] [PATCH 2/3] Introduce percpu rw semaphores Paul E. McKenney
  0 siblings, 2 replies; 45+ messages in thread
From: Mikulas Patocka @ 2012-07-29  5:13 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jens Axboe, Andrea Arcangeli, Jan Kara, dm-devel, linux-kernel,
	Jeff Moyer, Alexander Viro, kosaki.motohiro, linux-fsdevel,
	lwoodman, Alasdair G. Kergon



On Sat, 28 Jul 2012, Eric Dumazet wrote:

> On Sat, 2012-07-28 at 12:41 -0400, Mikulas Patocka wrote:
> > Introduce percpu rw semaphores
> > 
> > When many CPUs are locking a rw semaphore for read concurrently, cache
> > line bouncing occurs. When a CPU acquires rw semaphore for read, the
> > CPU writes to the cache line holding the semaphore. Consequently, the
> > cache line is being moved between CPUs and this slows down semaphore
> > acquisition.
> > 
> > This patch introduces new percpu rw semaphores. They are functionally
> > identical to existing rw semaphores, but locking the percpu rw semaphore
> > for read is faster and locking for write is slower.
> > 
> > The percpu rw semaphore is implemented as a percpu array of rw
> > semaphores, each semaphore for one CPU. When some thread needs to lock
> > the semaphore for read, only semaphore on the current CPU is locked for
> > read. When some thread needs to lock the semaphore for write, semaphores
> > for all CPUs are locked for write. This avoids cache line bouncing.
> > 
> > Note that the thread that is locking percpu rw semaphore may be
> > rescheduled, it doesn't cause bug, but cache line bouncing occurs in
> > this case.
> > 
> > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> 
> I am curious to see how this performs with 4096 cpus ?

Each cpu should have its own rw semaphore in its cache, so I don't see a 
problem there.

When you change block size, all 4096 rw semaphores are locked for write, 
but changing block size is not a performance sensitive operation.

> Really you shouldnt use rwlock in a path if this might hurt performance.
> 
> RCU is probably a better answer.

RCU is meaningless here. RCU allows lockless dereference of a pointer. 
Here the problem is not pointer dereference, the problem is that integer 
bd_block_size may change.

> (bdev->bd_block_size should be read exactly once )

Rewrite all direct and non-direct io code so that it reads block size just 
once ...

Mikulas

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [dm-devel] [PATCH 2/3] Introduce percpu rw semaphores
  2012-07-29  5:13                 ` [dm-devel] " Mikulas Patocka
@ 2012-07-29 10:10                   ` Eric Dumazet
  2012-07-29 18:36                     ` Eric Dumazet
  2012-07-30 17:00                   ` [dm-devel] [PATCH 2/3] Introduce percpu rw semaphores Paul E. McKenney
  1 sibling, 1 reply; 45+ messages in thread
From: Eric Dumazet @ 2012-07-29 10:10 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Jens Axboe, Andrea Arcangeli, Jan Kara, dm-devel, linux-kernel,
	Jeff Moyer, Alexander Viro, kosaki.motohiro, linux-fsdevel,
	lwoodman, Alasdair G. Kergon

On Sun, 2012-07-29 at 01:13 -0400, Mikulas Patocka wrote:

> Each cpu should have its own rw semaphore in its cache, so I don't see a 
> problem there.
> 
> When you change block size, all 4096 rw semaphores are locked for write, 
> but changing block size is not a performance sensitive operation.
> 
> > Really you shouldnt use rwlock in a path if this might hurt performance.
> > 
> > RCU is probably a better answer.
> 
> RCU is meaningless here. RCU allows lockless dereference of a pointer. 
> Here the problem is not pointer dereference, the problem is that integer 
> bd_block_size may change.

So add a pointer if you need to. Thats the point.

> 
> > (bdev->bd_block_size should be read exactly once )
> 
> Rewrite all direct and non-direct io code so that it reads block size just 
> once ...


You introduced percpu rw semaphores, thats only incentive for people to
use that infrastructure elsewhere.

And its a big hammer :

sizeof(struct rw_semaphore)=0x70 

You can probably design something needing no more than 4 bytes per cpu,
and this thing could use non locked operations as bonus.

like the following ...

struct percpu_rw_semaphore {
	/* percpu_sem_down_read() use the following in fast path */
	unsigned int __percpu *active_counters;

	unsigned int __percpu *counters;
	struct rw_semaphore	sem; /* used in slow path and by writers */
};

static inline int percpu_sem_init(struct percpu_rw_semaphore *p)
{
	p->counters = alloc_percpu(unsigned int);
	if (!p->counters)
		return -ENOMEM;
	init_rwsem(&p->sem);
	p->active_counters = p->counters;
	return 0;
}


static inline bool percpu_sem_down_read(struct percpu_rw_semaphore *p)
{
	unsigned int __percpu *counters = ACCESS_ONCE(p->active_counters);

	if (counters) {
		this_cpu_inc(*counters);
		return true;
	}
	down_read(&p->sem);
	return false;
}

static inline void percpu_sem_up_read(struct percpu_rw_semaphore *p, bool fastpath)
{
	if (fastpath)
		this_cpu_dec(*p->counters);
	else
		up_read(&p->sem);
}

static inline unsigned int percpu_count(unsigned int *counters)
{
	unsigned int total = 0;
	int cpu;

	for_each_possible_cpu(cpu)
		total += *per_cpu_ptr(counters, cpu);

	return total;
}

static inline void percpu_sem_down_write(struct percpu_rw_semaphore *p)
{
	down_write(&p->sem);
	p->active_counters = NULL;

	while (percpu_count(p->counters))
		schedule();
}

static inline void percpu_sem_up_write(struct percpu_rw_semaphore *p)
{
	p->active_counters = p->counters;
	up_write(&p->sem);
}





^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [dm-devel] [PATCH 2/3] Introduce percpu rw semaphores
  2012-07-29 10:10                   ` Eric Dumazet
@ 2012-07-29 18:36                     ` Eric Dumazet
  2012-08-01 20:07                       ` Mikulas Patocka
  2012-08-01 20:09                       ` [PATCH 4/3] " Mikulas Patocka
  0 siblings, 2 replies; 45+ messages in thread
From: Eric Dumazet @ 2012-07-29 18:36 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Jens Axboe, Andrea Arcangeli, Jan Kara, dm-devel, linux-kernel,
	Jeff Moyer, Alexander Viro, kosaki.motohiro, linux-fsdevel,
	lwoodman, Alasdair G. Kergon

On Sun, 2012-07-29 at 12:10 +0200, Eric Dumazet wrote:

> You can probably design something needing no more than 4 bytes per cpu,
> and this thing could use non locked operations as bonus.
> 
> like the following ...

Coming back from my bike ride, here is a more polished version with
proper synchronization/ barriers.

struct percpu_rw_semaphore {
	/* percpu_sem_down_read() use the following in fast path */
	unsigned int __percpu *active_counters;

	unsigned int __percpu *counters;
	struct rw_semaphore	sem; /* used in slow path and by writers */
};

static inline int percpu_sem_init(struct percpu_rw_semaphore *p)
{
	p->counters = alloc_percpu(unsigned int);
	if (!p->counters)
		return -ENOMEM;
	init_rwsem(&p->sem);
	rcu_assign_pointer(p->active_counters, p->counters);
	return 0;
}


static inline bool percpu_sem_down_read(struct percpu_rw_semaphore *p)
{
	unsigned int __percpu *counters;

	rcu_read_lock();
	counters = rcu_dereference(p->active_counters);
	if (counters) {
		this_cpu_inc(*counters);
		smp_wmb(); /* paired with smp_rmb() in percpu_count() */
		rcu_read_unlock();
		return true;
	}
	rcu_read_unlock();
	down_read(&p->sem);
	return false;
}

static inline void percpu_sem_up_read(struct percpu_rw_semaphore *p, bool fastpath)
{
	if (fastpath)
		this_cpu_dec(*p->counters);
	else
		up_read(&p->sem);
}

static inline unsigned int percpu_count(unsigned int __percpu *counters)
{
	unsigned int total = 0;
	int cpu;

	for_each_possible_cpu(cpu)
		total += *per_cpu_ptr(counters, cpu);

	return total;
}

static inline void percpu_sem_down_write(struct percpu_rw_semaphore *p)
{
	down_write(&p->sem);
	p->active_counters = NULL;
	synchronize_rcu();
	smp_rmb(); /* paired with smp_wmb() in percpu_sem_down_read() */

	while (percpu_count(p->counters))
		schedule();
}

static inline void percpu_sem_up_write(struct percpu_rw_semaphore *p)
{
	rcu_assign_pointer(p->active_counters, p->counters);
	up_write(&p->sem);
}



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [dm-devel] [PATCH 2/3] Introduce percpu rw semaphores
  2012-07-29  5:13                 ` [dm-devel] " Mikulas Patocka
  2012-07-29 10:10                   ` Eric Dumazet
@ 2012-07-30 17:00                   ` Paul E. McKenney
  2012-07-31  0:00                     ` Mikulas Patocka
  1 sibling, 1 reply; 45+ messages in thread
From: Paul E. McKenney @ 2012-07-30 17:00 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Eric Dumazet, Jens Axboe, Andrea Arcangeli, Jan Kara, dm-devel,
	linux-kernel, Jeff Moyer, Alexander Viro, kosaki.motohiro,
	linux-fsdevel, lwoodman, Alasdair G. Kergon

On Sun, Jul 29, 2012 at 01:13:34AM -0400, Mikulas Patocka wrote:
> On Sat, 28 Jul 2012, Eric Dumazet wrote:
> > On Sat, 2012-07-28 at 12:41 -0400, Mikulas Patocka wrote:

[ . . . ]

> > (bdev->bd_block_size should be read exactly once )
> 
> Rewrite all direct and non-direct io code so that it reads block size just 
> once ...

For whatever it is worth, the 3.5 Linux kernel only has about ten mentions
of bd_block_size, at least according to cscope.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [dm-devel] [PATCH 2/3] Introduce percpu rw semaphores
  2012-07-30 17:00                   ` [dm-devel] [PATCH 2/3] Introduce percpu rw semaphores Paul E. McKenney
@ 2012-07-31  0:00                     ` Mikulas Patocka
  2012-08-01 17:15                       ` Paul E. McKenney
  0 siblings, 1 reply; 45+ messages in thread
From: Mikulas Patocka @ 2012-07-31  0:00 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Eric Dumazet, Jens Axboe, Andrea Arcangeli, Jan Kara, dm-devel,
	linux-kernel, Jeff Moyer, Alexander Viro, kosaki.motohiro,
	linux-fsdevel, lwoodman, Alasdair G. Kergon



On Mon, 30 Jul 2012, Paul E. McKenney wrote:

> On Sun, Jul 29, 2012 at 01:13:34AM -0400, Mikulas Patocka wrote:
> > On Sat, 28 Jul 2012, Eric Dumazet wrote:
> > > On Sat, 2012-07-28 at 12:41 -0400, Mikulas Patocka wrote:
> 
> [ . . . ]
> 
> > > (bdev->bd_block_size should be read exactly once )
> > 
> > Rewrite all direct and non-direct io code so that it reads block size just 
> > once ...
> 
> For whatever it is worth, the 3.5 Linux kernel only has about ten mentions
> of bd_block_size, at least according to cscope.
> 
> 							Thanx, Paul

plus 213 uses of i_blkbits (which is derived directly from bd_block_size). 
45 of them is in fs/*.c and mm/*.c

Mikulas

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [dm-devel] [PATCH 2/3] Introduce percpu rw semaphores
  2012-07-31  0:00                     ` Mikulas Patocka
@ 2012-08-01 17:15                       ` Paul E. McKenney
  0 siblings, 0 replies; 45+ messages in thread
From: Paul E. McKenney @ 2012-08-01 17:15 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Eric Dumazet, Jens Axboe, Andrea Arcangeli, Jan Kara, dm-devel,
	linux-kernel, Jeff Moyer, Alexander Viro, kosaki.motohiro,
	linux-fsdevel, lwoodman, Alasdair G. Kergon

On Mon, Jul 30, 2012 at 08:00:19PM -0400, Mikulas Patocka wrote:
> 
> 
> On Mon, 30 Jul 2012, Paul E. McKenney wrote:
> 
> > On Sun, Jul 29, 2012 at 01:13:34AM -0400, Mikulas Patocka wrote:
> > > On Sat, 28 Jul 2012, Eric Dumazet wrote:
> > > > On Sat, 2012-07-28 at 12:41 -0400, Mikulas Patocka wrote:
> > 
> > [ . . . ]
> > 
> > > > (bdev->bd_block_size should be read exactly once )
> > > 
> > > Rewrite all direct and non-direct io code so that it reads block size just 
> > > once ...
> > 
> > For whatever it is worth, the 3.5 Linux kernel only has about ten mentions
> > of bd_block_size, at least according to cscope.
> 
> plus 213 uses of i_blkbits (which is derived directly from bd_block_size). 
> 45 of them is in fs/*.c and mm/*.c

At least it is only hundreds rather than thousands!  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [dm-devel] [PATCH 2/3] Introduce percpu rw semaphores
  2012-07-29 18:36                     ` Eric Dumazet
@ 2012-08-01 20:07                       ` Mikulas Patocka
  2012-08-01 20:09                       ` [PATCH 4/3] " Mikulas Patocka
  1 sibling, 0 replies; 45+ messages in thread
From: Mikulas Patocka @ 2012-08-01 20:07 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jens Axboe, Andrea Arcangeli, Jan Kara, dm-devel, linux-kernel,
	Jeff Moyer, Alexander Viro, kosaki.motohiro, linux-fsdevel,
	lwoodman, Alasdair G. Kergon



On Sun, 29 Jul 2012, Eric Dumazet wrote:

> On Sun, 2012-07-29 at 12:10 +0200, Eric Dumazet wrote:
> 
> > You can probably design something needing no more than 4 bytes per cpu,
> > and this thing could use non locked operations as bonus.
> > 
> > like the following ...
> 
> Coming back from my bike ride, here is a more polished version with
> proper synchronization/ barriers.
> 
> struct percpu_rw_semaphore {
> 	/* percpu_sem_down_read() use the following in fast path */
> 	unsigned int __percpu *active_counters;
> 
> 	unsigned int __percpu *counters;
> 	struct rw_semaphore	sem; /* used in slow path and by writers */
> };
> 
> static inline int percpu_sem_init(struct percpu_rw_semaphore *p)
> {
> 	p->counters = alloc_percpu(unsigned int);
> 	if (!p->counters)
> 		return -ENOMEM;
> 	init_rwsem(&p->sem);
> 	rcu_assign_pointer(p->active_counters, p->counters);
> 	return 0;
> }
> 
> 
> static inline bool percpu_sem_down_read(struct percpu_rw_semaphore *p)
> {
> 	unsigned int __percpu *counters;
> 
> 	rcu_read_lock();
> 	counters = rcu_dereference(p->active_counters);
> 	if (counters) {
> 		this_cpu_inc(*counters);
> 		smp_wmb(); /* paired with smp_rmb() in percpu_count() */

Why is this barrier needed? RCU works as a barrier doesn't it?
RCU is unlocked when the cpu passes a quiescent state, and I suppose that 
entering the quiescent state works as a barrier. Or doesn't it?

> 		rcu_read_unlock();
> 		return true;
> 	}
> 	rcu_read_unlock();
> 	down_read(&p->sem);
> 	return false;
> }
> 
> static inline void percpu_sem_up_read(struct percpu_rw_semaphore *p, bool fastpath)
> {
> 	if (fastpath)
> 		this_cpu_dec(*p->counters);
> 	else
> 		up_read(&p->sem);
> }
> 
> static inline unsigned int percpu_count(unsigned int __percpu *counters)
> {
> 	unsigned int total = 0;
> 	int cpu;
> 
> 	for_each_possible_cpu(cpu)
> 		total += *per_cpu_ptr(counters, cpu);
> 
> 	return total;
> }
> 
> static inline void percpu_sem_down_write(struct percpu_rw_semaphore *p)
> {
> 	down_write(&p->sem);
> 	p->active_counters = NULL;
> 	synchronize_rcu();
> 	smp_rmb(); /* paired with smp_wmb() in percpu_sem_down_read() */

Why barrier here? Synchronize_rcu() doesn't work as a barrier?

Mikulas

> 	while (percpu_count(p->counters))
> 		schedule();
> }
> 
> static inline void percpu_sem_up_write(struct percpu_rw_semaphore *p)
> {
> 	rcu_assign_pointer(p->active_counters, p->counters);
> 	up_write(&p->sem);
> }
> 
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 4/3] Introduce percpu rw semaphores
  2012-07-29 18:36                     ` Eric Dumazet
  2012-08-01 20:07                       ` Mikulas Patocka
@ 2012-08-01 20:09                       ` Mikulas Patocka
  2012-08-31 18:40                         ` [PATCH 0/4] Fix a crash when block device is read and block size is changed at the same time Mikulas Patocka
  1 sibling, 1 reply; 45+ messages in thread
From: Mikulas Patocka @ 2012-08-01 20:09 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jens Axboe, Andrea Arcangeli, Jan Kara, dm-devel, linux-kernel,
	Jeff Moyer, Alexander Viro, kosaki.motohiro, linux-fsdevel,
	lwoodman, Alasdair G. Kergon



On Sun, 29 Jul 2012, Eric Dumazet wrote:

> On Sun, 2012-07-29 at 12:10 +0200, Eric Dumazet wrote:
> 
> > You can probably design something needing no more than 4 bytes per cpu,
> > and this thing could use non locked operations as bonus.
> > 
> > like the following ...
> 
> Coming back from my bike ride, here is a more polished version with
> proper synchronization/ barriers.

Hi Eric

I reworked your patch (it should be applied after my previous patch 3/3). 
I replaced the rw-semaphore with a mutex. Instead of two pointers, I 
changed it to one pointer and one bool variable.

I removed the barriers next to rcu (because, rcu works as a barrier) and 
added a barrier when decrementing the percpu variable and when waiting for 
percpu_count to be zero.

I tested performance of all implementation:
30.2s with no lock at all
32.2s with global rw-lock
30.6s with per-cpu rw-lock (my original implementation and Eric Dumazet's 
implementation make no difference)

Mikulas

---

New percpu lock implementation

An alternative percpu lock implementation. The original idea by
Eric Dumazet <eric.dumazet@gmail.com>

The lock consists of an array of percpu unsigned integers, a boolean
variable and a mutex.

When we take the lock for read, we enter rcu read section, check for a
"locked" variable. If it is false, we increase a percpu counter on the
current cpu and exit the rcu section. If "locked" is true, we exit the
rcu section, take the mutex and drop it (this waits until a writer
finished) and retry.

Unlocking for read just decreases percpu variable. Note that we can
unlock on a difference cpu than where we locked, in this case the
counter underflows. The sum of all percpu counters represents the number
of processes that hold the lock for read.

When we need to lock for write, we take the mutex, set "locked" variable
to true and synchronize rcu. Since RCU has been synchronized, no
processes can create new read locks. We wait until the sum of percpu
counters is zero - when it is, there are no readers in the critical
section.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 fs/block_dev.c               |   15 ++----
 include/linux/percpu-rwsem.h |   93 +++++++++++++++++++++----------------------
 2 files changed, 53 insertions(+), 55 deletions(-)

Index: linux-3.5-fast/fs/block_dev.c
===================================================================
--- linux-3.5-fast.orig/fs/block_dev.c	2012-07-31 21:59:24.000000000 +0200
+++ linux-3.5-fast/fs/block_dev.c	2012-08-01 19:39:26.000000000 +0200
@@ -1602,13 +1602,12 @@ ssize_t blkdev_aio_read(struct kiocb *io
 {
 	ssize_t ret;
 	struct block_device *bdev = I_BDEV(iocb->ki_filp->f_mapping->host);
-	percpu_rwsem_ptr p;
 
-	p = percpu_down_read(&bdev->bd_block_size_semaphore);
+	percpu_down_read(&bdev->bd_block_size_semaphore);
 
 	ret = generic_file_aio_read(iocb, iov, nr_segs, pos);
 
-	percpu_up_read(&bdev->bd_block_size_semaphore, p);
+	percpu_up_read(&bdev->bd_block_size_semaphore);
 
 	return ret;
 }
@@ -1627,11 +1626,10 @@ ssize_t blkdev_aio_write(struct kiocb *i
 	struct file *file = iocb->ki_filp;
 	struct block_device *bdev = I_BDEV(file->f_mapping->host);
 	ssize_t ret;
-	percpu_rwsem_ptr p;
 
 	BUG_ON(iocb->ki_pos != pos);
 
-	p = percpu_down_read(&bdev->bd_block_size_semaphore);
+	percpu_down_read(&bdev->bd_block_size_semaphore);
 
 	ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
 	if (ret > 0 || ret == -EIOCBQUEUED) {
@@ -1642,7 +1640,7 @@ ssize_t blkdev_aio_write(struct kiocb *i
 			ret = err;
 	}
 
-	percpu_up_read(&bdev->bd_block_size_semaphore, p);
+	percpu_up_read(&bdev->bd_block_size_semaphore);
 
 	return ret;
 }
@@ -1652,13 +1650,12 @@ int blkdev_mmap(struct file *file, struc
 {
 	int ret;
 	struct block_device *bdev = I_BDEV(file->f_mapping->host);
-	percpu_rwsem_ptr p;
 
-	p = percpu_down_read(&bdev->bd_block_size_semaphore);
+	percpu_down_read(&bdev->bd_block_size_semaphore);
 
 	ret = generic_file_mmap(file, vma);
 
-	percpu_up_read(&bdev->bd_block_size_semaphore, p);
+	percpu_up_read(&bdev->bd_block_size_semaphore);
 
 	return ret;
 }
Index: linux-3.5-fast/include/linux/percpu-rwsem.h
===================================================================
--- linux-3.5-fast.orig/include/linux/percpu-rwsem.h	2012-07-31 21:47:25.000000000 +0200
+++ linux-3.5-fast/include/linux/percpu-rwsem.h	2012-08-01 19:32:53.000000000 +0200
@@ -3,75 +3,76 @@
 
 #include <linux/rwsem.h>
 #include <linux/percpu.h>
-
-#ifndef CONFIG_SMP
-
-#define percpu_rw_semaphore	rw_semaphore
-#define percpu_rwsem_ptr	int
-#define percpu_down_read(x)	(down_read(x), 0)
-#define percpu_up_read(x, y)	up_read(x)
-#define percpu_down_write	down_write
-#define percpu_up_write		up_write
-#define percpu_init_rwsem(x)	(({init_rwsem(x);}), 0)
-#define percpu_free_rwsem(x)	do { } while (0)
-
-#else
+#include <linux/rcupdate.h>
+#include <linux/delay.h>
 
 struct percpu_rw_semaphore {
-	struct rw_semaphore __percpu *s;
+	unsigned __percpu *counters;
+	bool locked;
+	struct mutex mtx;
 };
 
-typedef struct rw_semaphore *percpu_rwsem_ptr;
-
-static inline percpu_rwsem_ptr percpu_down_read(struct percpu_rw_semaphore *sem)
+static inline void percpu_down_read(struct percpu_rw_semaphore *p)
 {
-	struct rw_semaphore *s = __this_cpu_ptr(sem->s);
-	down_read(s);
-	return s;
+retry:
+	rcu_read_lock();
+	if (unlikely(p->locked)) {
+		rcu_read_unlock();
+		mutex_lock(&p->mtx);
+		mutex_unlock(&p->mtx);
+		goto retry;
+	}
+	this_cpu_inc(*p->counters);
+	rcu_read_unlock();
 }
 
-static inline void percpu_up_read(struct percpu_rw_semaphore *sem, percpu_rwsem_ptr s)
+static inline void percpu_up_read(struct percpu_rw_semaphore *p)
 {
-	up_read(s);
+	smp_wmb();
+	this_cpu_dec(*p->counters);
 }
 
-static inline void percpu_down_write(struct percpu_rw_semaphore *sem)
+static inline unsigned int percpu_count(unsigned __percpu *counters)
 {
+	unsigned total = 0;
 	int cpu;
-	for_each_possible_cpu(cpu) {
-		struct rw_semaphore *s = per_cpu_ptr(sem->s, cpu);
-		down_write(s);
-	}
+
+	for_each_possible_cpu(cpu)
+		total += ACCESS_ONCE(*per_cpu_ptr(counters, cpu));
+
+	return total;
 }
 
-static inline void percpu_up_write(struct percpu_rw_semaphore *sem)
+static inline void percpu_down_write(struct percpu_rw_semaphore *p)
 {
-	int cpu;
-	for_each_possible_cpu(cpu) {
-		struct rw_semaphore *s = per_cpu_ptr(sem->s, cpu);
-		up_write(s);
-	}
+	mutex_lock(&p->mtx);
+	p->locked = true;
+	synchronize_rcu();
+	while (percpu_count(p->counters))
+		msleep(1);
+	smp_rmb(); /* paired with smp_wmb() in percpu_sem_up_read() */
 }
 
-static inline int percpu_init_rwsem(struct percpu_rw_semaphore *sem)
+static inline void percpu_up_write(struct percpu_rw_semaphore *p)
 {
-	int cpu;
-	sem->s = alloc_percpu(struct rw_semaphore);
-	if (unlikely(!sem->s))
+	p->locked = false;
+	mutex_unlock(&p->mtx);
+}
+
+static inline int percpu_init_rwsem(struct percpu_rw_semaphore *p)
+{
+	p->counters = alloc_percpu(unsigned);
+	if (unlikely(!p->counters))
 		return -ENOMEM;
-	for_each_possible_cpu(cpu) {
-		struct rw_semaphore *s = per_cpu_ptr(sem->s, cpu);
-		init_rwsem(s);
-	}
+	p->locked = false;
+	mutex_init(&p->mtx);
 	return 0;
 }
 
-static inline void percpu_free_rwsem(struct percpu_rw_semaphore *sem)
+static inline void percpu_free_rwsem(struct percpu_rw_semaphore *p)
 {
-	free_percpu(sem->s);
-	sem->s = NULL;		/* catch use after free bugs */
+	free_percpu(p->counters);
+	p->counters = NULL; /* catch use after free bugs */
 }
 
 #endif
-
-#endif

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 0/4] Fix a crash when block device is read and block size is changed at the same time
  2012-08-01 20:09                       ` [PATCH 4/3] " Mikulas Patocka
@ 2012-08-31 18:40                         ` Mikulas Patocka
  2012-08-31 18:41                           ` [PATCH 1/4] Add a lock that will be needed by the next patch Mikulas Patocka
  2012-08-31 19:27                           ` [PATCH 0/4] Fix a crash when block device is read and block size is changed at the same time Mikulas Patocka
  0 siblings, 2 replies; 45+ messages in thread
From: Mikulas Patocka @ 2012-08-31 18:40 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jens Axboe, Andrea Arcangeli, Jan Kara, dm-devel, linux-kernel,
	Jeff Moyer, Alexander Viro, kosaki.motohiro, linux-fsdevel,
	lwoodman, Alasdair G. Kergon

Hi

This is a series of patches to prevent a crash when when someone is 
reading block device and block size is changed simultaneously. (the crash 
is already happening in the production environment)

The first patch adds a rw-lock to struct block_device, but doesn't use the 
lock anywhere. The reason why I submit this as a separate patch is that on 
my computer adding an unused field to this structure affects performance 
much more than any locking changes.

The second patch uses the rw-lock. The lock is locked for read when doing 
I/O on the block device and it is locked for write when changing block 
size.

The third patch converts the rw-lock to a percpu rw-lock for better 
performance, to avoid cache line bouncing.

The fourth patch is an alternate percpu rw-lock implementation using RCU 
by Eric Dumazet. It avoids any atomic instruction in the hot path.

Mikulas

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 1/4] Add a lock that will be needed by the next patch
  2012-08-31 18:40                         ` [PATCH 0/4] Fix a crash when block device is read and block size is changed at the same time Mikulas Patocka
@ 2012-08-31 18:41                           ` Mikulas Patocka
  2012-08-31 18:42                             ` [PATCH 2/4] blockdev: fix a crash when block size is changed and I/O is issued simultaneously Mikulas Patocka
  2012-08-31 19:27                           ` [PATCH 0/4] Fix a crash when block device is read and block size is changed at the same time Mikulas Patocka
  1 sibling, 1 reply; 45+ messages in thread
From: Mikulas Patocka @ 2012-08-31 18:41 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jens Axboe, Andrea Arcangeli, Jan Kara, dm-devel, linux-kernel,
	Jeff Moyer, Alexander Viro, kosaki.motohiro, linux-fsdevel,
	lwoodman, Alasdair G. Kergon

Add a lock that will be needed by the next patch.

I am submitting this as a separate patch, because the act of adding an
extra field to "struct block_device" affects performance significantly.

So that performance change from adding the lock field can be
differentiated from the performance change of locking.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 drivers/char/raw.c |    2 -
 fs/block_dev.c     |   60 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 include/linux/fs.h |    4 +++
 3 files changed, 63 insertions(+), 3 deletions(-)

Index: linux-3.5-rc6-devel/include/linux/fs.h
===================================================================
--- linux-3.5-rc6-devel.orig/include/linux/fs.h	2012-07-16 20:20:12.000000000 +0200
+++ linux-3.5-rc6-devel/include/linux/fs.h	2012-07-16 01:29:21.000000000 +0200
@@ -713,6 +713,8 @@ struct block_device {
 	int			bd_fsfreeze_count;
 	/* Mutex for freeze */
 	struct mutex		bd_fsfreeze_mutex;
+	/* A semaphore that prevents I/O while block size is being changed */
+	struct rw_semaphore	bd_block_size_semaphore;
 };
 
 /*


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 2/4] blockdev: fix a crash when block size is changed and I/O is issued simultaneously
  2012-08-31 18:41                           ` [PATCH 1/4] Add a lock that will be needed by the next patch Mikulas Patocka
@ 2012-08-31 18:42                             ` Mikulas Patocka
  2012-08-31 18:43                               ` [PATCH 3/4] blockdev: turn a rw semaphore into a percpu rw semaphore Mikulas Patocka
  0 siblings, 1 reply; 45+ messages in thread
From: Mikulas Patocka @ 2012-08-31 18:42 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jens Axboe, Andrea Arcangeli, Jan Kara, dm-devel, linux-kernel,
	Jeff Moyer, Alexander Viro, kosaki.motohiro, linux-fsdevel,
	lwoodman, Alasdair G. Kergon

blockdev: fix a crash when block size is changed and I/O is issued simultaneously

The kernel may crash when block size is changed and I/O is issued
simultaneously.

Because some subsystems (udev or lvm) may read any block device anytime,
the bug actually puts any code that changes a block device size in
jeopardy.

The crash can be reproduced if you place "msleep(1000)" to
blkdev_get_blocks just before "bh->b_size = max_blocks <<
inode->i_blkbits;".
Then, run "dd if=/dev/ram0 of=/dev/null bs=4k count=1 iflag=direct"
While it is waiting in msleep, run "blockdev --setbsz 2048 /dev/ram0"
You get a BUG.

The direct and non-direct I/O is written with the assumption that block
size does not change. It doesn't seem practical to fix these crashes
one-by-one there may be many crash possibilities when block size changes
at a certain place and it is impossible to find them all and verify the
code.

This patch introduces a new rw-lock bd_block_size_semaphore. The lock is
taken for read during I/O. It is taken for write when changing block
size. Consequently, block size can't be changed while I/O is being
submitted.

For asynchronous I/O, the patch only prevents block size change while
the I/O is being submitted. The block size can change when the I/O is in
progress or when the I/O is being finished. This is acceptable because
there are no accesses to block size when asynchronous I/O is being
finished.

The patch prevents block size changing while the device is mapped with
mmap.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 drivers/char/raw.c |    2 -
 fs/block_dev.c     |   60 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 include/linux/fs.h |    2 +
 3 files changed, 61 insertions(+), 3 deletions(-)

Index: linux-3.5-fast/fs/block_dev.c
===================================================================
--- linux-3.5-fast.orig/fs/block_dev.c	2012-08-02 23:05:30.000000000 +0200
+++ linux-3.5-fast/fs/block_dev.c	2012-08-02 23:05:39.000000000 +0200
@@ -116,6 +116,8 @@ EXPORT_SYMBOL(invalidate_bdev);
 
 int set_blocksize(struct block_device *bdev, int size)
 {
+	struct address_space *mapping;
+
 	/* Size must be a power of two, and between 512 and PAGE_SIZE */
 	if (size > PAGE_SIZE || size < 512 || !is_power_of_2(size))
 		return -EINVAL;
@@ -124,6 +126,20 @@ int set_blocksize(struct block_device *b
 	if (size < bdev_logical_block_size(bdev))
 		return -EINVAL;
 
+	/* Prevent starting I/O or mapping the device */
+	down_write(&bdev->bd_block_size_semaphore);
+
+	/* Check that the block device is not memory mapped */
+	mapping = bdev->bd_inode->i_mapping;
+	mutex_lock(&mapping->i_mmap_mutex);
+	if (!prio_tree_empty(&mapping->i_mmap) ||
+	    !list_empty(&mapping->i_mmap_nonlinear)) {
+		mutex_unlock(&mapping->i_mmap_mutex);
+		up_write(&bdev->bd_block_size_semaphore);
+		return -EBUSY;
+	}
+	mutex_unlock(&mapping->i_mmap_mutex);
+
 	/* Don't change the size if it is same as current */
 	if (bdev->bd_block_size != size) {
 		sync_blockdev(bdev);
@@ -131,6 +147,9 @@ int set_blocksize(struct block_device *b
 		bdev->bd_inode->i_blkbits = blksize_bits(size);
 		kill_bdev(bdev);
 	}
+
+	up_write(&bdev->bd_block_size_semaphore);
+
 	return 0;
 }
 
@@ -472,6 +491,7 @@ static void init_once(void *foo)
 	inode_init_once(&ei->vfs_inode);
 	/* Initialize mutex for freeze. */
 	mutex_init(&bdev->bd_fsfreeze_mutex);
+	init_rwsem(&bdev->bd_block_size_semaphore);
 }
 
 static inline void __bd_forget(struct inode *inode)
@@ -1567,6 +1587,22 @@ static long block_ioctl(struct file *fil
 	return blkdev_ioctl(bdev, mode, cmd, arg);
 }
 
+ssize_t blkdev_aio_read(struct kiocb *iocb, const struct iovec *iov,
+			unsigned long nr_segs, loff_t pos)
+{
+	ssize_t ret;
+	struct block_device *bdev = I_BDEV(iocb->ki_filp->f_mapping->host);
+
+	down_read(&bdev->bd_block_size_semaphore);
+
+	ret = generic_file_aio_read(iocb, iov, nr_segs, pos);
+
+	up_read(&bdev->bd_block_size_semaphore);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(blkdev_aio_read);
+
 /*
  * Write data to the block device.  Only intended for the block device itself
  * and the raw driver which basically is a fake block device.
@@ -1578,10 +1614,13 @@ ssize_t blkdev_aio_write(struct kiocb *i
 			 unsigned long nr_segs, loff_t pos)
 {
 	struct file *file = iocb->ki_filp;
+	struct block_device *bdev = I_BDEV(file->f_mapping->host);
 	ssize_t ret;
 
 	BUG_ON(iocb->ki_pos != pos);
 
+	down_read(&bdev->bd_block_size_semaphore);
+
 	ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
 	if (ret > 0 || ret == -EIOCBQUEUED) {
 		ssize_t err;
@@ -1590,10 +1629,27 @@ ssize_t blkdev_aio_write(struct kiocb *i
 		if (err < 0 && ret > 0)
 			ret = err;
 	}
+
+	up_read(&bdev->bd_block_size_semaphore);
+
 	return ret;
 }
 EXPORT_SYMBOL_GPL(blkdev_aio_write);
 
+int blkdev_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	int ret;
+	struct block_device *bdev = I_BDEV(file->f_mapping->host);
+
+	down_read(&bdev->bd_block_size_semaphore);
+
+	ret = generic_file_mmap(file, vma);
+
+	up_read(&bdev->bd_block_size_semaphore);
+
+	return ret;
+}
+
 /*
  * Try to release a page associated with block device when the system
  * is under memory pressure.
@@ -1624,9 +1680,9 @@ const struct file_operations def_blk_fop
 	.llseek		= block_llseek,
 	.read		= do_sync_read,
 	.write		= do_sync_write,
-  	.aio_read	= generic_file_aio_read,
+  	.aio_read	= blkdev_aio_read,
 	.aio_write	= blkdev_aio_write,
-	.mmap		= generic_file_mmap,
+	.mmap		= blkdev_mmap,
 	.fsync		= blkdev_fsync,
 	.unlocked_ioctl	= block_ioctl,
 #ifdef CONFIG_COMPAT
Index: linux-3.5-fast/drivers/char/raw.c
===================================================================
--- linux-3.5-fast.orig/drivers/char/raw.c	2012-08-02 23:05:11.000000000 +0200
+++ linux-3.5-fast/drivers/char/raw.c	2012-08-02 23:05:39.000000000 +0200
@@ -285,7 +285,7 @@ static long raw_ctl_compat_ioctl(struct
 
 static const struct file_operations raw_fops = {
 	.read		= do_sync_read,
-	.aio_read	= generic_file_aio_read,
+	.aio_read	= blkdev_aio_read,
 	.write		= do_sync_write,
 	.aio_write	= blkdev_aio_write,
 	.fsync		= blkdev_fsync,
Index: linux-3.5-fast/include/linux/fs.h
===================================================================
--- linux-3.5-fast.orig/include/linux/fs.h	2012-08-02 23:05:38.000000000 +0200
+++ linux-3.5-fast/include/linux/fs.h	2012-08-02 23:05:39.000000000 +0200
@@ -2416,6 +2416,8 @@ extern int generic_segment_checks(const
 		unsigned long *nr_segs, size_t *count, int access_flags);
 
 /* fs/block_dev.c */
+extern ssize_t blkdev_aio_read(struct kiocb *iocb, const struct iovec *iov,
+			       unsigned long nr_segs, loff_t pos);
 extern ssize_t blkdev_aio_write(struct kiocb *iocb, const struct iovec *iov,
 				unsigned long nr_segs, loff_t pos);
 extern int blkdev_fsync(struct file *filp, loff_t start, loff_t end,


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 3/4] blockdev: turn a rw semaphore into a percpu rw semaphore
  2012-08-31 18:42                             ` [PATCH 2/4] blockdev: fix a crash when block size is changed and I/O is issued simultaneously Mikulas Patocka
@ 2012-08-31 18:43                               ` Mikulas Patocka
  2012-08-31 18:43                                 ` [PATCH 4/4] New percpu lock implementation Mikulas Patocka
  0 siblings, 1 reply; 45+ messages in thread
From: Mikulas Patocka @ 2012-08-31 18:43 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jens Axboe, Andrea Arcangeli, Jan Kara, dm-devel, linux-kernel,
	Jeff Moyer, Alexander Viro, kosaki.motohiro, linux-fsdevel,
	lwoodman, Alasdair G. Kergon

blockdev: turn a rw semaphore into a percpu rw semaphore

This avoids cache line bouncing when many processes lock the semaphore
for read.

Partially based on a patch by Jeff Moyer <jmoyer@redhat.com>.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 fs/block_dev.c     |   30 ++++++++++++++++++++----------
 include/linux/fs.h |    3 ++-
 2 files changed, 22 insertions(+), 11 deletions(-)

Index: linux-3.5-fast/fs/block_dev.c
===================================================================
--- linux-3.5-fast.orig/fs/block_dev.c	2012-07-28 18:32:10.000000000 +0200
+++ linux-3.5-fast/fs/block_dev.c	2012-07-28 18:32:12.000000000 +0200
@@ -127,7 +127,7 @@ int set_blocksize(struct block_device *b
 		return -EINVAL;
 
 	/* Prevent starting I/O or mapping the device */
-	down_write(&bdev->bd_block_size_semaphore);
+	percpu_down_write(&bdev->bd_block_size_semaphore);
 
 	/* Check that the block device is not memory mapped */
 	mapping = bdev->bd_inode->i_mapping;
@@ -135,7 +135,7 @@ int set_blocksize(struct block_device *b
 	if (!prio_tree_empty(&mapping->i_mmap) ||
 	    !list_empty(&mapping->i_mmap_nonlinear)) {
 		mutex_unlock(&mapping->i_mmap_mutex);
-		up_write(&bdev->bd_block_size_semaphore);
+		percpu_up_write(&bdev->bd_block_size_semaphore);
 		return -EBUSY;
 	}
 	mutex_unlock(&mapping->i_mmap_mutex);
@@ -148,7 +148,7 @@ int set_blocksize(struct block_device *b
 		kill_bdev(bdev);
 	}
 
-	up_write(&bdev->bd_block_size_semaphore);
+	percpu_up_write(&bdev->bd_block_size_semaphore);
 
 	return 0;
 }
@@ -460,6 +460,12 @@ static struct inode *bdev_alloc_inode(st
 	struct bdev_inode *ei = kmem_cache_alloc(bdev_cachep, GFP_KERNEL);
 	if (!ei)
 		return NULL;
+
+	if (unlikely(percpu_init_rwsem(&ei->bdev.bd_block_size_semaphore))) {
+		kmem_cache_free(bdev_cachep, ei);
+		return NULL;
+	}
+
 	return &ei->vfs_inode;
 }
 
@@ -468,6 +474,8 @@ static void bdev_i_callback(struct rcu_h
 	struct inode *inode = container_of(head, struct inode, i_rcu);
 	struct bdev_inode *bdi = BDEV_I(inode);
 
+	percpu_free_rwsem(&bdi->bdev.bd_block_size_semaphore);
+
 	kmem_cache_free(bdev_cachep, bdi);
 }
 
@@ -491,7 +499,6 @@ static void init_once(void *foo)
 	inode_init_once(&ei->vfs_inode);
 	/* Initialize mutex for freeze. */
 	mutex_init(&bdev->bd_fsfreeze_mutex);
-	init_rwsem(&bdev->bd_block_size_semaphore);
 }
 
 static inline void __bd_forget(struct inode *inode)
@@ -1592,12 +1599,13 @@ ssize_t blkdev_aio_read(struct kiocb *io
 {
 	ssize_t ret;
 	struct block_device *bdev = I_BDEV(iocb->ki_filp->f_mapping->host);
+	percpu_rwsem_ptr p;
 
-	down_read(&bdev->bd_block_size_semaphore);
+	p = percpu_down_read(&bdev->bd_block_size_semaphore);
 
 	ret = generic_file_aio_read(iocb, iov, nr_segs, pos);
 
-	up_read(&bdev->bd_block_size_semaphore);
+	percpu_up_read(&bdev->bd_block_size_semaphore, p);
 
 	return ret;
 }
@@ -1616,10 +1624,11 @@ ssize_t blkdev_aio_write(struct kiocb *i
 	struct file *file = iocb->ki_filp;
 	struct block_device *bdev = I_BDEV(file->f_mapping->host);
 	ssize_t ret;
+	percpu_rwsem_ptr p;
 
 	BUG_ON(iocb->ki_pos != pos);
 
-	down_read(&bdev->bd_block_size_semaphore);
+	p = percpu_down_read(&bdev->bd_block_size_semaphore);
 
 	ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
 	if (ret > 0 || ret == -EIOCBQUEUED) {
@@ -1630,7 +1639,7 @@ ssize_t blkdev_aio_write(struct kiocb *i
 			ret = err;
 	}
 
-	up_read(&bdev->bd_block_size_semaphore);
+	percpu_up_read(&bdev->bd_block_size_semaphore, p);
 
 	return ret;
 }
@@ -1640,12 +1649,13 @@ int blkdev_mmap(struct file *file, struc
 {
 	int ret;
 	struct block_device *bdev = I_BDEV(file->f_mapping->host);
+	percpu_rwsem_ptr p;
 
-	down_read(&bdev->bd_block_size_semaphore);
+	p = percpu_down_read(&bdev->bd_block_size_semaphore);
 
 	ret = generic_file_mmap(file, vma);
 
-	up_read(&bdev->bd_block_size_semaphore);
+	percpu_up_read(&bdev->bd_block_size_semaphore, p);
 
 	return ret;
 }
Index: linux-3.5-fast/include/linux/fs.h
===================================================================
--- linux-3.5-fast.orig/include/linux/fs.h	2012-07-28 18:32:10.000000000 +0200
+++ linux-3.5-fast/include/linux/fs.h	2012-07-28 18:32:12.000000000 +0200
@@ -10,6 +10,7 @@
 #include <linux/ioctl.h>
 #include <linux/blk_types.h>
 #include <linux/types.h>
+#include <linux/percpu-rwsem.h>
 
 /*
  * It's silly to have NR_OPEN bigger than NR_FILE, but you can change
@@ -714,7 +715,7 @@ struct block_device {
 	/* Mutex for freeze */
 	struct mutex		bd_fsfreeze_mutex;
 	/* A semaphore that prevents I/O while block size is being changed */
-	struct rw_semaphore	bd_block_size_semaphore;
+	struct percpu_rw_semaphore	bd_block_size_semaphore;
 };
 
 /*
Introduce percpu rw semaphores

When many CPUs are locking a rw semaphore for read concurrently, cache
line bouncing occurs. When a CPU acquires rw semaphore for read, the
CPU writes to the cache line holding the semaphore. Consequently, the
cache line is being moved between CPUs and this slows down semaphore
acquisition.

This patch introduces new percpu rw semaphores. They are functionally
identical to existing rw semaphores, but locking the percpu rw semaphore
for read is faster and locking for write is slower.

The percpu rw semaphore is implemented as a percpu array of rw
semaphores, each semaphore for one CPU. When some thread needs to lock
the semaphore for read, only semaphore on the current CPU is locked for
read. When some thread needs to lock the semaphore for write, semaphores
for all CPUs are locked for write. This avoids cache line bouncing.

Note that the thread that is locking percpu rw semaphore may be
rescheduled, it doesn't cause bug, but cache line bouncing occurs in
this case.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 include/linux/percpu-rwsem.h |   77 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 77 insertions(+)

Index: linux-3.5-fast/include/linux/percpu-rwsem.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-3.5-fast/include/linux/percpu-rwsem.h	2012-07-28 18:41:05.000000000 +0200
@@ -0,0 +1,77 @@
+#ifndef _LINUX_PERCPU_RWSEM_H
+#define _LINUX_PERCPU_RWSEM_H
+
+#include <linux/rwsem.h>
+#include <linux/percpu.h>
+
+#ifndef CONFIG_SMP
+
+#define percpu_rw_semaphore	rw_semaphore
+#define percpu_rwsem_ptr	int
+#define percpu_down_read(x)	(down_read(x), 0)
+#define percpu_up_read(x, y)	up_read(x)
+#define percpu_down_write	down_write
+#define percpu_up_write		up_write
+#define percpu_init_rwsem(x)	(({init_rwsem(x);}), 0)
+#define percpu_free_rwsem(x)	do { } while (0)
+
+#else
+
+struct percpu_rw_semaphore {
+	struct rw_semaphore __percpu *s;
+};
+
+typedef struct rw_semaphore *percpu_rwsem_ptr;
+
+static inline percpu_rwsem_ptr percpu_down_read(struct percpu_rw_semaphore *sem)
+{
+	struct rw_semaphore *s = __this_cpu_ptr(sem->s);
+	down_read(s);
+	return s;
+}
+
+static inline void percpu_up_read(struct percpu_rw_semaphore *sem, percpu_rwsem_ptr s)
+{
+	up_read(s);
+}
+
+static inline void percpu_down_write(struct percpu_rw_semaphore *sem)
+{
+	int cpu;
+	for_each_possible_cpu(cpu) {
+		struct rw_semaphore *s = per_cpu_ptr(sem->s, cpu);
+		down_write(s);
+	}
+}
+
+static inline void percpu_up_write(struct percpu_rw_semaphore *sem)
+{
+	int cpu;
+	for_each_possible_cpu(cpu) {
+		struct rw_semaphore *s = per_cpu_ptr(sem->s, cpu);
+		up_write(s);
+	}
+}
+
+static inline int percpu_init_rwsem(struct percpu_rw_semaphore *sem)
+{
+	int cpu;
+	sem->s = alloc_percpu(struct rw_semaphore);
+	if (unlikely(!sem->s))
+		return -ENOMEM;
+	for_each_possible_cpu(cpu) {
+		struct rw_semaphore *s = per_cpu_ptr(sem->s, cpu);
+		init_rwsem(s);
+	}
+	return 0;
+}
+
+static inline void percpu_free_rwsem(struct percpu_rw_semaphore *sem)
+{
+	free_percpu(sem->s);
+	sem->s = NULL;		/* catch use after free bugs */
+}
+
+#endif
+
+#endif


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 4/4] New percpu lock implementation
  2012-08-31 18:43                               ` [PATCH 3/4] blockdev: turn a rw semaphore into a percpu rw semaphore Mikulas Patocka
@ 2012-08-31 18:43                                 ` Mikulas Patocka
  0 siblings, 0 replies; 45+ messages in thread
From: Mikulas Patocka @ 2012-08-31 18:43 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jens Axboe, Andrea Arcangeli, Jan Kara, dm-devel, linux-kernel,
	Jeff Moyer, Alexander Viro, kosaki.motohiro, linux-fsdevel,
	lwoodman, Alasdair G. Kergon

New percpu lock implementation

An alternative percpu lock implementation. The original idea by
Eric Dumazet <eric.dumazet@gmail.com>

The lock consists of an array of percpu unsigned integers, a boolean
variable and a mutex.

When we take the lock for read, we enter rcu read section, check for a
"locked" variable. If it is false, we increase a percpu counter on the
current cpu and exit the rcu section. If "locked" is true, we exit the
rcu section, take the mutex and drop it (this waits until a writer
finished) and retry.

Unlocking for read just decreases percpu variable. Note that we can
unlock on a difference cpu than where we locked, in this case the
counter underflows. The sum of all percpu counters represents the number
of processes that hold the lock for read.

When we need to lock for write, we take the mutex, set "locked" variable
to true and synchronize rcu. Since RCU has been synchronized, no
processes can create new read locks. We wait until the sum of percpu
counters is zero - when it is, there are no readers in the critical
section.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 fs/block_dev.c               |   15 ++----
 include/linux/percpu-rwsem.h |   95 +++++++++++++++++++++----------------------
 2 files changed, 54 insertions(+), 56 deletions(-)

Index: linux-3.5.2-fast/fs/block_dev.c
===================================================================
--- linux-3.5.2-fast.orig/fs/block_dev.c	2012-08-18 01:23:32.000000000 +0200
+++ linux-3.5.2-fast/fs/block_dev.c	2012-08-18 01:23:32.000000000 +0200
@@ -1599,13 +1599,12 @@ ssize_t blkdev_aio_read(struct kiocb *io
 {
 	ssize_t ret;
 	struct block_device *bdev = I_BDEV(iocb->ki_filp->f_mapping->host);
-	percpu_rwsem_ptr p;
 
-	p = percpu_down_read(&bdev->bd_block_size_semaphore);
+	percpu_down_read(&bdev->bd_block_size_semaphore);
 
 	ret = generic_file_aio_read(iocb, iov, nr_segs, pos);
 
-	percpu_up_read(&bdev->bd_block_size_semaphore, p);
+	percpu_up_read(&bdev->bd_block_size_semaphore);
 
 	return ret;
 }
@@ -1624,11 +1623,10 @@ ssize_t blkdev_aio_write(struct kiocb *i
 	struct file *file = iocb->ki_filp;
 	struct block_device *bdev = I_BDEV(file->f_mapping->host);
 	ssize_t ret;
-	percpu_rwsem_ptr p;
 
 	BUG_ON(iocb->ki_pos != pos);
 
-	p = percpu_down_read(&bdev->bd_block_size_semaphore);
+	percpu_down_read(&bdev->bd_block_size_semaphore);
 
 	ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
 	if (ret > 0 || ret == -EIOCBQUEUED) {
@@ -1639,7 +1637,7 @@ ssize_t blkdev_aio_write(struct kiocb *i
 			ret = err;
 	}
 
-	percpu_up_read(&bdev->bd_block_size_semaphore, p);
+	percpu_up_read(&bdev->bd_block_size_semaphore);
 
 	return ret;
 }
@@ -1649,13 +1647,12 @@ int blkdev_mmap(struct file *file, struc
 {
 	int ret;
 	struct block_device *bdev = I_BDEV(file->f_mapping->host);
-	percpu_rwsem_ptr p;
 
-	p = percpu_down_read(&bdev->bd_block_size_semaphore);
+	percpu_down_read(&bdev->bd_block_size_semaphore);
 
 	ret = generic_file_mmap(file, vma);
 
-	percpu_up_read(&bdev->bd_block_size_semaphore, p);
+	percpu_up_read(&bdev->bd_block_size_semaphore);
 
 	return ret;
 }
Index: linux-3.5.2-fast/include/linux/percpu-rwsem.h
===================================================================
--- linux-3.5.2-fast.orig/include/linux/percpu-rwsem.h	2012-08-18 01:23:32.000000000 +0200
+++ linux-3.5.2-fast/include/linux/percpu-rwsem.h	2012-08-18 01:26:08.000000000 +0200
@@ -1,77 +1,78 @@
 #ifndef _LINUX_PERCPU_RWSEM_H
 #define _LINUX_PERCPU_RWSEM_H
 
-#include <linux/rwsem.h>
+#include <linux/mutex.h>
 #include <linux/percpu.h>
-
-#ifndef CONFIG_SMP
-
-#define percpu_rw_semaphore	rw_semaphore
-#define percpu_rwsem_ptr	int
-#define percpu_down_read(x)	(down_read(x), 0)
-#define percpu_up_read(x, y)	up_read(x)
-#define percpu_down_write	down_write
-#define percpu_up_write		up_write
-#define percpu_init_rwsem(x)	(({init_rwsem(x);}), 0)
-#define percpu_free_rwsem(x)	do { } while (0)
-
-#else
+#include <linux/rcupdate.h>
+#include <linux/delay.h>
 
 struct percpu_rw_semaphore {
-	struct rw_semaphore __percpu *s;
+	unsigned __percpu *counters;
+	bool locked;
+	struct mutex mtx;
 };
 
-typedef struct rw_semaphore *percpu_rwsem_ptr;
-
-static inline percpu_rwsem_ptr percpu_down_read(struct percpu_rw_semaphore *sem)
+static inline void percpu_down_read(struct percpu_rw_semaphore *p)
 {
-	struct rw_semaphore *s = __this_cpu_ptr(sem->s);
-	down_read(s);
-	return s;
+	rcu_read_lock();
+	if (unlikely(p->locked)) {
+		rcu_read_unlock();
+		mutex_lock(&p->mtx);
+		this_cpu_inc(*p->counters);
+		mutex_unlock(&p->mtx);
+		return;
+	}
+	this_cpu_inc(*p->counters);
+	rcu_read_unlock();
 }
 
-static inline void percpu_up_read(struct percpu_rw_semaphore *sem, percpu_rwsem_ptr s)
+static inline void percpu_up_read(struct percpu_rw_semaphore *p)
 {
-	up_read(s);
+	smp_wmb();
+	this_cpu_dec(*p->counters);
 }
 
-static inline void percpu_down_write(struct percpu_rw_semaphore *sem)
+static inline unsigned int percpu_count(unsigned __percpu *counters)
 {
+	unsigned total = 0;
 	int cpu;
-	for_each_possible_cpu(cpu) {
-		struct rw_semaphore *s = per_cpu_ptr(sem->s, cpu);
-		down_write(s);
-	}
+
+	for_each_possible_cpu(cpu)
+		total += ACCESS_ONCE(*per_cpu_ptr(counters, cpu));
+
+	return total;
 }
 
-static inline void percpu_up_write(struct percpu_rw_semaphore *sem)
+static inline void percpu_down_write(struct percpu_rw_semaphore *p)
 {
-	int cpu;
-	for_each_possible_cpu(cpu) {
-		struct rw_semaphore *s = per_cpu_ptr(sem->s, cpu);
-		up_write(s);
-	}
+	mutex_lock(&p->mtx);
+	p->locked = true;
+	synchronize_rcu();
+	while (percpu_count(p->counters))
+		msleep(1);
+	smp_rmb(); /* paired with smp_wmb() in percpu_sem_up_read() */
 }
 
-static inline int percpu_init_rwsem(struct percpu_rw_semaphore *sem)
+static inline void percpu_up_write(struct percpu_rw_semaphore *p)
 {
-	int cpu;
-	sem->s = alloc_percpu(struct rw_semaphore);
-	if (unlikely(!sem->s))
+	p->locked = false;
+	mutex_unlock(&p->mtx);
+}
+
+static inline int percpu_init_rwsem(struct percpu_rw_semaphore *p)
+{
+	p->counters = alloc_percpu(unsigned);
+	if (unlikely(!p->counters))
 		return -ENOMEM;
-	for_each_possible_cpu(cpu) {
-		struct rw_semaphore *s = per_cpu_ptr(sem->s, cpu);
-		init_rwsem(s);
-	}
+	p->locked = false;
+	mutex_init(&p->mtx);
 	return 0;
 }
 
-static inline void percpu_free_rwsem(struct percpu_rw_semaphore *sem)
+static inline void percpu_free_rwsem(struct percpu_rw_semaphore *p)
 {
-	free_percpu(sem->s);
-	sem->s = NULL;		/* catch use after free bugs */
+	free_percpu(p->counters);
+	p->counters = NULL; /* catch use after free bugs */
 }
 
 #endif
-
-#endif


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 0/4] Fix a crash when block device is read and block size is changed at the same time
  2012-08-31 18:40                         ` [PATCH 0/4] Fix a crash when block device is read and block size is changed at the same time Mikulas Patocka
  2012-08-31 18:41                           ` [PATCH 1/4] Add a lock that will be needed by the next patch Mikulas Patocka
@ 2012-08-31 19:27                           ` Mikulas Patocka
  2012-08-31 20:11                             ` Jeff Moyer
  1 sibling, 1 reply; 45+ messages in thread
From: Mikulas Patocka @ 2012-08-31 19:27 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jens Axboe, Andrea Arcangeli, Jan Kara, dm-devel, linux-kernel,
	Jeff Moyer, Alexander Viro, kosaki.motohiro, linux-fsdevel,
	lwoodman, Alasdair G. Kergon



On Fri, 31 Aug 2012, Mikulas Patocka wrote:

> Hi
> 
> This is a series of patches to prevent a crash when when someone is 
> reading block device and block size is changed simultaneously. (the crash 
> is already happening in the production environment)
> 
> The first patch adds a rw-lock to struct block_device, but doesn't use the 
> lock anywhere. The reason why I submit this as a separate patch is that on 
> my computer adding an unused field to this structure affects performance 
> much more than any locking changes.
> 
> The second patch uses the rw-lock. The lock is locked for read when doing 
> I/O on the block device and it is locked for write when changing block 
> size.
> 
> The third patch converts the rw-lock to a percpu rw-lock for better 
> performance, to avoid cache line bouncing.
> 
> The fourth patch is an alternate percpu rw-lock implementation using RCU 
> by Eric Dumazet. It avoids any atomic instruction in the hot path.
> 
> Mikulas

I tested performance of patches. I created 4GB ramdisk, I initially filled 
it with zeros (so that ramdisk allocation-on-demand doesn't affect the 
results).

I ran fio to perform 8 concurrent accesses on 8 core machine (two 
Barcelona Opterons):
time fio --rw=randrw --size=4G --bs=512 --filename=/dev/ram0 --direct=1 
--name=job1 --name=job2 --name=job3 --name=job4 --name=job5 --name=job6 
--name=job7 --name=job8

The results actually show that the size of struct block_device and 
alignment of subsequent fields in struct inode have far more effect on 
result that the type of locking used. (struct inode is placed just after 
struct block_device in "struct bdev_inode" in fs/block-dev.c)

plain kernel 3.5.3: 57.9s
patch 1: 43.4s
patches 1,2: 43.7s
patches 1,2,3: 38.5s
patches 1,2,3,4: 58.6s

You can see that patch 1 improves the time by 14.5 seconds, but all that 
patch 1 does is adding an unused structure field.

Patch 3 is 4.9 seconds faster than patch 1, althogh patch 1 does no 
locking at all and patch 3 does per-cpu locking. So, the reason for the 
speedup is different sizeof of struct block_device (and subsequently, 
different alignment of struct inode), rather than locking improvement.

I would be interested if other people did performance testing of the 
patches too.

Mikulas

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 0/4] Fix a crash when block device is read and block size is changed at the same time
  2012-08-31 19:27                           ` [PATCH 0/4] Fix a crash when block device is read and block size is changed at the same time Mikulas Patocka
@ 2012-08-31 20:11                             ` Jeff Moyer
  2012-08-31 20:34                               ` Mikulas Patocka
  2012-09-17 21:19                               ` Jeff Moyer
  0 siblings, 2 replies; 45+ messages in thread
From: Jeff Moyer @ 2012-08-31 20:11 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Eric Dumazet, Jens Axboe, Andrea Arcangeli, Jan Kara, dm-devel,
	linux-kernel, Alexander Viro, kosaki.motohiro, linux-fsdevel,
	lwoodman, Alasdair G. Kergon

Mikulas Patocka <mpatocka@redhat.com> writes:

> On Fri, 31 Aug 2012, Mikulas Patocka wrote:
>
>> Hi
>> 
>> This is a series of patches to prevent a crash when when someone is 
>> reading block device and block size is changed simultaneously. (the crash 
>> is already happening in the production environment)
>> 
>> The first patch adds a rw-lock to struct block_device, but doesn't use the 
>> lock anywhere. The reason why I submit this as a separate patch is that on 
>> my computer adding an unused field to this structure affects performance 
>> much more than any locking changes.
>> 
>> The second patch uses the rw-lock. The lock is locked for read when doing 
>> I/O on the block device and it is locked for write when changing block 
>> size.
>> 
>> The third patch converts the rw-lock to a percpu rw-lock for better 
>> performance, to avoid cache line bouncing.
>> 
>> The fourth patch is an alternate percpu rw-lock implementation using RCU 
>> by Eric Dumazet. It avoids any atomic instruction in the hot path.
>> 
>> Mikulas
>
> I tested performance of patches. I created 4GB ramdisk, I initially filled 
> it with zeros (so that ramdisk allocation-on-demand doesn't affect the 
> results).
>
> I ran fio to perform 8 concurrent accesses on 8 core machine (two 
> Barcelona Opterons):
> time fio --rw=randrw --size=4G --bs=512 --filename=/dev/ram0 --direct=1 
> --name=job1 --name=job2 --name=job3 --name=job4 --name=job5 --name=job6 
> --name=job7 --name=job8
>
> The results actually show that the size of struct block_device and 
> alignment of subsequent fields in struct inode have far more effect on 
> result that the type of locking used. (struct inode is placed just after 
> struct block_device in "struct bdev_inode" in fs/block-dev.c)
>
> plain kernel 3.5.3: 57.9s
> patch 1: 43.4s
> patches 1,2: 43.7s
> patches 1,2,3: 38.5s
> patches 1,2,3,4: 58.6s
>
> You can see that patch 1 improves the time by 14.5 seconds, but all that 
> patch 1 does is adding an unused structure field.
>
> Patch 3 is 4.9 seconds faster than patch 1, althogh patch 1 does no 
> locking at all and patch 3 does per-cpu locking. So, the reason for the 
> speedup is different sizeof of struct block_device (and subsequently, 
> different alignment of struct inode), rather than locking improvement.

How many runs did you do?  Did you see much run to run variation?

> I would be interested if other people did performance testing of the 
> patches too.

I'll do some testing next week, but don't expect to get to it before
Wednesday.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 0/4] Fix a crash when block device is read and block size is changed at the same time
  2012-08-31 20:11                             ` Jeff Moyer
@ 2012-08-31 20:34                               ` Mikulas Patocka
  2012-09-17 21:19                               ` Jeff Moyer
  1 sibling, 0 replies; 45+ messages in thread
From: Mikulas Patocka @ 2012-08-31 20:34 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Eric Dumazet, Jens Axboe, Andrea Arcangeli, Jan Kara, dm-devel,
	linux-kernel, Alexander Viro, kosaki.motohiro, linux-fsdevel,
	lwoodman, Alasdair G. Kergon



On Fri, 31 Aug 2012, Jeff Moyer wrote:

> Mikulas Patocka <mpatocka@redhat.com> writes:
> 
> > On Fri, 31 Aug 2012, Mikulas Patocka wrote:
> >
> >> Hi
> >> 
> >> This is a series of patches to prevent a crash when when someone is 
> >> reading block device and block size is changed simultaneously. (the crash 
> >> is already happening in the production environment)
> >> 
> >> The first patch adds a rw-lock to struct block_device, but doesn't use the 
> >> lock anywhere. The reason why I submit this as a separate patch is that on 
> >> my computer adding an unused field to this structure affects performance 
> >> much more than any locking changes.
> >> 
> >> The second patch uses the rw-lock. The lock is locked for read when doing 
> >> I/O on the block device and it is locked for write when changing block 
> >> size.
> >> 
> >> The third patch converts the rw-lock to a percpu rw-lock for better 
> >> performance, to avoid cache line bouncing.
> >> 
> >> The fourth patch is an alternate percpu rw-lock implementation using RCU 
> >> by Eric Dumazet. It avoids any atomic instruction in the hot path.
> >> 
> >> Mikulas
> >
> > I tested performance of patches. I created 4GB ramdisk, I initially filled 
> > it with zeros (so that ramdisk allocation-on-demand doesn't affect the 
> > results).
> >
> > I ran fio to perform 8 concurrent accesses on 8 core machine (two 
> > Barcelona Opterons):
> > time fio --rw=randrw --size=4G --bs=512 --filename=/dev/ram0 --direct=1 
> > --name=job1 --name=job2 --name=job3 --name=job4 --name=job5 --name=job6 
> > --name=job7 --name=job8
> >
> > The results actually show that the size of struct block_device and 
> > alignment of subsequent fields in struct inode have far more effect on 
> > result that the type of locking used. (struct inode is placed just after 
> > struct block_device in "struct bdev_inode" in fs/block-dev.c)
> >
> > plain kernel 3.5.3: 57.9s
> > patch 1: 43.4s
> > patches 1,2: 43.7s
> > patches 1,2,3: 38.5s
> > patches 1,2,3,4: 58.6s
> >
> > You can see that patch 1 improves the time by 14.5 seconds, but all that 
> > patch 1 does is adding an unused structure field.
> >
> > Patch 3 is 4.9 seconds faster than patch 1, althogh patch 1 does no 
> > locking at all and patch 3 does per-cpu locking. So, the reason for the 
> > speedup is different sizeof of struct block_device (and subsequently, 
> > different alignment of struct inode), rather than locking improvement.
> 
> How many runs did you do?  Did you see much run to run variation?

These results come from two runs (which differed by no more than 1s), but 
I observed the same phenomenon - difference in time due to the size of 
block_device - many times before when I was doing benchmarking when 
developing these patches.

I actually had to apply something like this to make the results not depend 
on the size of block_dev.

I would be interested if the same difference could be observed on other 
processors or if it is something specific to AMD K10 architecture.

---
 fs/block_dev.c |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

Index: linux-3.5.3-fast/fs/block_dev.c
===================================================================
--- linux-3.5.3-fast.orig/fs/block_dev.c	2012-08-31 22:30:07.000000000 +0200
+++ linux-3.5.3-fast/fs/block_dev.c	2012-08-31 22:30:43.000000000 +0200
@@ -31,7 +31,10 @@
 #include "internal.h"
 
 struct bdev_inode {
-	struct block_device bdev;
+	union {
+		struct block_device bdev;
+		char pad[0x140];
+	};
 	struct inode vfs_inode;
 };
 

> > I would be interested if other people did performance testing of the 
> > patches too.
> 
> I'll do some testing next week, but don't expect to get to it before
> Wednesday.
> 
> Cheers,
> Jeff

Mikulas

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 0/4] Fix a crash when block device is read and block size is changed at the same time
  2012-08-31 20:11                             ` Jeff Moyer
  2012-08-31 20:34                               ` Mikulas Patocka
@ 2012-09-17 21:19                               ` Jeff Moyer
  2012-09-18 17:04                                 ` Mikulas Patocka
  1 sibling, 1 reply; 45+ messages in thread
From: Jeff Moyer @ 2012-09-17 21:19 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Eric Dumazet, Jens Axboe, Andrea Arcangeli, Jan Kara, dm-devel,
	linux-kernel, Alexander Viro, kosaki.motohiro, linux-fsdevel,
	lwoodman, Alasdair G. Kergon

[-- Attachment #1: Type: text/plain, Size: 15768 bytes --]

Jeff Moyer <jmoyer@redhat.com> writes:
> Mikulas Patocka <mpatocka@redhat.com> writes:
>> I would be interested if other people did performance testing of the 
>> patches too.
>
> I'll do some testing next week, but don't expect to get to it before
> Wednesday.

Sorry for taking so long on this.  I managed to get access to an 80cpu
(160 threads) system with 1TB of memory.  I installed a pcie ssd into
this machine and did some testing against the raw block device.

I've attached the fio job file I used.  Basically, I tested sequential
reads, sequential writes, random reads, random writes, and then a mix of
sequential reads and writes, and a mix of random reads and writes.  All
tests used direct I/O to the block device, and each number shown is an
average of 5 runs.  I had to pin the fio processes to the same numa node
as the pcie adapter in order to get low run-to-run variations.  Because
of the numa factor, I was unable to get reliable results running
processes against all of the 160 threads on the system.  The runs below
have 4 processes, each pushing a queue depth of 1024.

So, on to the results.  I haven't fully investigated them yet, but I
plan to as they are rather surprising.

The first patch in the series simply adds a semaphore to the
block_device structure.  Mikulas, you had mentioned that this managed to
have a large effect on your test load.  In my case, this didn't seem to
make any difference at all:

      3.6.0-rc5+-job.fio-run2/output-avg
          	        READ                WRITE                 CPU          
  Job Name	     BW   IOPS  msec      BW   IOPS  msec   usr  sys   csw
    write1	      0      0     0  748522 187130 44864 16.34 60.65 3799440.00
     read1	 690615 172653 48602       0      0     0 13.45 61.42 4044720.00
randwrite1	      0      0     0  716406 179101 46839 29.03 52.79 3151140.00
 randread1	 683466 170866 49108       0      0     0 25.92 54.67 3081610.00
readwrite1	 377518  94379 44450  377645  94410 44450 15.49 64.32 3139240.00
   randrw1	 355815  88953 47178  355733  88933 47178 27.96 54.24 2944570.00
3.6.0-rc5.mikulas.1+-job.fio-run2/output-avg
          	        READ                WRITE                 CPU          
  Job Name	     BW   IOPS  msec      BW   IOPS  msec   usr  sys   csw
    write1	      0      0     0  764037 191009 43925 17.14 60.15 3536950.00
     read1	 696880 174220 48152       0      0     0 13.90 61.74 3710168.00
randwrite1	      0      0     0  737331 184332 45511 29.82 52.71 2869440.00
 randread1	 689319 172329 48684       0      0     0 26.38 54.58 2927411.00
readwrite1	 387651  96912 43294  387799  96949 43294 16.06 64.92 2814340.00
   randrw1	 360298  90074 46591  360304  90075 46591 28.53 54.10 2793120.00
                                   %diff
          	        READ                WRITE                 CPU          
  Job Name	     BW   IOPS  msec      BW   IOPS  msec   usr  sys   csw
    write1	      0      0     0       0      0     0 0.00 0.00  -6.91
     read1	      0      0     0       0      0     0 0.00 0.00  -8.27
randwrite1	      0      0     0       0      0     0 0.00 0.00  -8.94
 randread1	      0      0     0       0      0     0 0.00 0.00  -5.00
readwrite1	      0      0     0       0      0     0 0.00 0.00 -10.35
   randrw1	      0      0     0       0      0     0 0.00 0.00  -5.14


The headings are:
BW = bandwidth in KB/s
IOPS = I/Os per second
msec = number of miliseconds the run took (smaller is better)
usr = %user time
sys = %system time
csw = context switches

The first two tables show the results of each run.  In this case, the
first is the unpatched kernel, and the second is the one with the
block_device structure change.  The third table is the % difference
between the two.  A positive number indicates the second run had a
larger average than the first.  I found that the context switch rate was
rather unpredictable, so I really should have just left that out of the
reporting.

As you can see, adding a member to struct block_device did not really
change the results.


Next up is the patch that actually uses the rw semaphore to protect
access to the block size.  Here are the results:

3.6.0-rc5.mikulas.1+-job.fio-run2/output-avg
          	        READ                WRITE                 CPU          
  Job Name	     BW   IOPS  msec      BW   IOPS  msec   usr  sys   csw
    write1	      0      0     0  764037 191009 43925 17.14 60.15 3536950.00
     read1	 696880 174220 48152       0      0     0 13.90 61.74 3710168.00
randwrite1	      0      0     0  737331 184332 45511 29.82 52.71 2869440.00
 randread1	 689319 172329 48684       0      0     0 26.38 54.58 2927411.00
readwrite1	 387651  96912 43294  387799  96949 43294 16.06 64.92 2814340.00
   randrw1	 360298  90074 46591  360304  90075 46591 28.53 54.10 2793120.00
3.6.0-rc5.mikulas.2+-job.fio-run2/output-avg
          	        READ                WRITE                 CPU          
  Job Name	     BW   IOPS  msec      BW   IOPS  msec   usr  sys   csw
    write1	      0      0     0  816713 204178 41108 16.60 62.06 3159574.00
     read1	 749437 187359 44800       0      0     0 13.91 63.69 3190050.00
randwrite1	      0      0     0  747534 186883 44941 29.96 53.23 2617590.00
 randread1	 734627 183656 45699       0      0     0 27.02 56.27 2403191.00
readwrite1	 396113  99027 42397  396120  99029 42397 14.50 63.21 3460140.00
   randrw1	 374408  93601 44806  374556  93638 44806 28.46 54.33 2688985.00
                                   %diff
          	        READ                WRITE                 CPU          
  Job Name	     BW   IOPS  msec      BW   IOPS  msec   usr  sys   csw
    write1	      0      0     0       6      6    -6 0.00 0.00 -10.67
     read1	      7      7    -6       0      0     0 0.00 0.00 -14.02
randwrite1	      0      0     0       0      0     0 0.00 0.00  -8.78
 randread1	      6      6    -6       0      0     0 0.00 0.00 -17.91
readwrite1	      0      0     0       0      0     0 -9.71 0.00  22.95
   randrw1	      0      0     0       0      0     0 0.00 0.00   0.00

As you can see, there were modest gains in write, read, and randread.
This is somewhat unexpected, as you would think that introducing locking
would not *help* performance!  Investigating the standard deviations for
each set of 5 runs shows that the performance difference is significant
(the standard deviation is reported as a percentage of the average):

This is a table of standard deviations for the 5 runs comprising the
above average with this kernel: 3.6.0-rc5.mikulas.1+

          	        READ                WRITE                 CPU          
  Job Name	     BW   IOPS  msec      BW   IOPS  msec   usr  sys   csw
    write1	      0      0     0       1      1     1 2.99 1.27   9.10
     read1	      0      0     0       0      0     0 2.12 0.53   5.03
randwrite1	      0      0     0       0      0     0 1.25 0.49   5.52
 randread1	      1      1     1       0      0     0 1.81 1.18  10.04
readwrite1	      2      2     2       2      2     2 11.35 1.86  26.83
   randrw1	      2      2     2       2      2     2 4.01 2.71  22.72

And here are the standard deviations for the .2+ kernel:

          	        READ                WRITE                 CPU          
  Job Name	     BW   IOPS  msec      BW   IOPS  msec   usr  sys   csw
    write1	      0      0     0       2      2     2 3.33 2.95   7.88
     read1	      2      2     2       0      0     0 6.44 2.30  19.27
randwrite1	      0      0     0       3      3     3 0.18 0.52   1.71
 randread1	      2      2     2       0      0     0 3.72 2.34  23.70
readwrite1	      3      3     3       3      3     3 3.35 2.61   7.38
   randrw1	      1      1     1       1      1     1 1.80 1.00   9.73


Next, we'll move on to the third patch in the series, which converts the
rw semaphore to a per-cpu semaphore.

3.6.0-rc5.mikulas.2+-job.fio-run2/output-avg
          	        READ                WRITE                 CPU          
  Job Name	     BW   IOPS  msec      BW   IOPS  msec   usr  sys   csw
    write1	      0      0     0  816713 204178 41108 16.60 62.06 3159574.00
     read1	 749437 187359 44800       0      0     0 13.91 63.69 3190050.00
randwrite1	      0      0     0  747534 186883 44941 29.96 53.23 2617590.00
 randread1	 734627 183656 45699       0      0     0 27.02 56.27 2403191.00
readwrite1	 396113  99027 42397  396120  99029 42397 14.50 63.21 3460140.00
   randrw1	 374408  93601 44806  374556  93638 44806 28.46 54.33 2688985.00
3.6.0-rc5.mikulas.3+-job.fio-run2/output-avg
          	        READ                WRITE                 CPU          
  Job Name	     BW   IOPS  msec      BW   IOPS  msec   usr  sys   csw
    write1	      0      0     0  870892 217723 38528 17.83 41.57 1697870.00
     read1	1430164 357541 23462       0      0     0 14.41 56.00 241315.00
randwrite1	      0      0     0  759789 189947 44163 31.48 36.36 1256040.00
 randread1	1043830 260958 32146       0      0     0 31.89 44.39 185032.00
readwrite1	 692567 173141 24226  692489 173122 24226 18.65 53.64 311255.00
   randrw1	 501208 125302 33469  501446 125361 33469 35.40 41.61 246391.00
                                   %diff
          	        READ                WRITE                 CPU          
  Job Name	     BW   IOPS  msec      BW   IOPS  msec   usr  sys   csw
    write1	      0      0     0       6      6    -6 7.41 -33.02 -46.26
     read1	     90     90   -47       0      0     0 0.00 -12.07 -92.44
randwrite1	      0      0     0       0      0     0 5.07 -31.69 -52.02
 randread1	     42     42   -29       0      0     0 18.02 -21.11 -92.30
readwrite1	     74     74   -42      74     74   -42 28.62 -15.14 -91.00
   randrw1	     33     33   -25      33     33   -25 24.39 -23.41 -90.84

Wow!  Switching to the per-cpu semaphore implementation just boosted the
performance of the I/O path big-time.  Note that the system time also
goes down!  So, we get better throughput and less system time.  This
sounds too good to be true.  ;-)  Here are the standard deviations
(again, shown as percentages) for the .3+ kernel:

          	        READ                WRITE                 CPU          
  Job Name	     BW   IOPS  msec      BW   IOPS  msec   usr  sys   csw
    write1	      0      0     0       0      0     0 0.96 0.19   1.03
     read1	      0      0     0       0      0     0 1.82 0.24   2.46
randwrite1	      0      0     0       0      0     0 0.40 0.39   0.68
 randread1	      0      0     0       0      0     0 0.53 0.31   2.02
readwrite1	      0      0     0       0      0     0 2.73 4.07  33.27
   randrw1	      1      1     1       1      1     1 0.40 0.10   3.29

Again, there's no slop there, so the results are very reproducible.

Finally, the last patch changes to an rcu-based rw semaphore
implementation.  Here are the results for that, as compared with the
previous kernel:

3.6.0-rc5.mikulas.3+-job.fio-run2/output-avg
          	        READ                WRITE                 CPU          
  Job Name	     BW   IOPS  msec      BW   IOPS  msec   usr  sys   csw
    write1	      0      0     0  870892 217723 38528 17.83 41.57 1697870.00
     read1	1430164 357541 23462       0      0     0 14.41 56.00 241315.00
randwrite1	      0      0     0  759789 189947 44163 31.48 36.36 1256040.00
 randread1	1043830 260958 32146       0      0     0 31.89 44.39 185032.00
readwrite1	 692567 173141 24226  692489 173122 24226 18.65 53.64 311255.00
   randrw1	 501208 125302 33469  501446 125361 33469 35.40 41.61 246391.00
3.6.0-rc5.mikulas.4+-job.fio-run2/output-avg
          	        READ                WRITE                 CPU          
  Job Name	     BW   IOPS  msec      BW   IOPS  msec   usr  sys   csw
    write1	      0      0     0  812659 203164 41309 16.80 61.71 3208620.00
     read1	 739061 184765 45442       0      0     0 14.32 62.85 3375484.00
randwrite1	      0      0     0  726971 181742 46192 30.00 52.33 2736270.00
 randread1	 719040 179760 46683       0      0     0 26.47 54.78 2914080.00
readwrite1	 396670  99167 42309  396619  99154 42309 14.91 63.12 3412220.00
   randrw1	 374790  93697 44766  374807  93701 44766 28.42 54.10 2774690.00
                                   %diff
          	        READ                WRITE                 CPU          
  Job Name	     BW   IOPS  msec      BW   IOPS  msec   usr  sys   csw
    write1	      0      0     0      -6     -6     7 -5.78 48.45  88.98
     read1	    -48    -48    93       0      0     0 0.00 12.23 1298.79
randwrite1	      0      0     0       0      0     0 0.00 43.92 117.85
 randread1	    -31    -31    45       0      0     0 -17.00 23.41 1474.91
readwrite1	    -42    -42    74     -42    -42    74 -20.05 17.67 996.28
   randrw1	    -25    -25    33     -25    -25    33 -19.72 30.02
   1026.13

And we've lost a good bit of performance!  Talk about
counter-intuitive.  Here are the standard deviation numbers:

          	        READ                WRITE                 CPU          
  Job Name	     BW   IOPS  msec      BW   IOPS  msec   usr  sys   csw
    write1	      0      0     0       2      2     2 2.96 3.00   6.79
     read1	      3      3     3       0      0     0 6.52 2.82  21.86
randwrite1	      0      0     0       2      2     2 0.71 0.55   4.07
 randread1	      1      1     1       0      0     0 4.13 2.31  20.12
readwrite1	      1      1     1       1      1     1 4.14 2.64   6.12
   randrw1	      0      0     0       0      0     0 0.59 0.25   2.99


Here is a comparison of the vanilla kernel versus the best performing
patch in this series (patch 3 of 4):

      3.6.0-rc5+-job.fio-run2/output-avg
          	        READ                WRITE                 CPU          
  Job Name	     BW   IOPS  msec      BW   IOPS  msec   usr  sys   csw
    write1	      0      0     0  748522 187130 44864 16.34 60.65 3799440.00
     read1	 690615 172653 48602       0      0     0 13.45 61.42 4044720.00
randwrite1	      0      0     0  716406 179101 46839 29.03 52.79 3151140.00
 randread1	 683466 170866 49108       0      0     0 25.92 54.67 3081610.00
readwrite1	 377518  94379 44450  377645  94410 44450 15.49 64.32 3139240.00
   randrw1	 355815  88953 47178  355733  88933 47178 27.96 54.24 2944570.00
3.6.0-rc5.mikulas.3+-job.fio-run2/output-avg
          	        READ                WRITE                 CPU          
  Job Name	     BW   IOPS  msec      BW   IOPS  msec   usr  sys   csw
    write1	      0      0     0  870892 217723 38528 17.83 41.57 1697870.00
     read1	1430164 357541 23462       0      0     0 14.41 56.00 241315.00
randwrite1	      0      0     0  759789 189947 44163 31.48 36.36 1256040.00
 randread1	1043830 260958 32146       0      0     0 31.89 44.39 185032.00
readwrite1	 692567 173141 24226  692489 173122 24226 18.65 53.64 311255.00
   randrw1	 501208 125302 33469  501446 125361 33469 35.40 41.61 246391.00
                                   %diff
          	        READ                WRITE                 CPU          
  Job Name	     BW   IOPS  msec      BW   IOPS  msec   usr  sys   csw
    write1	      0      0     0      16     16   -14 9.12 -31.46 -55.31
     read1	    107    107   -51       0      0     0 7.14 -8.82 -94.03
randwrite1	      0      0     0       6      6    -5 8.44 -31.12 -60.14
 randread1	     52     52   -34       0      0     0 23.03 -18.80 -94.00
readwrite1	     83     83   -45      83     83   -45 20.40 -16.60 -90.09
   randrw1	     40     40   -29      40     40   -29 26.61 -23.29 -91.63


Next up, I'm going to get some perf and blktrace data from these runs to
see if I can identify why there is such a drastic change in
performance.  I will also attempt to run the tests against a different
vendor's adapter, and maybe against some FC storage if I can set that up.

Cheers,
Jeff


[-- Attachment #2: job.fio --]
[-- Type: text/plain, Size: 883 bytes --]

[global]
ioengine=libaio
direct=1
iodepth=1024
iodepth_batch=32
iodepth_batch_complete=1
blocksize=4k
filename=/dev/XXX
size=8g
group_reporting=1
readwrite=write

[write1]
offset=0

[write2]
offset=8g

[write3]
offset=16g

[write4]
offset=24g

[global]
readwrite=read

[read1]
stonewall
offset=0

[read2]
offset=8g

[read3]
offset=16g

[read4]
offset=24g

[global]
readwrite=randwrite

[randwrite1]
stonewall
offset=0

[randwrite2]
offset=8g

[randwrite3]
offset=16g

[randwrite4]
offset=24g

[global]
readwrite=randread

[randread1]
stonewall
offset=0

[randread2]
offset=8g

[randread3]
offset=16g

[randread4]
offset=24g

[global]
readwrite=readwrite

[readwrite1]
stonewall
offset=0

[readwrite2]
offset=8g

[readwrite3]
offset=16g

[readwrite4]
offset=24g

[global]
readwrite=randrw

[randrw1]
stonewall
offset=0

[randrw2]
offset=8g

[randrw3]
offset=16g

[randrw4]
offset=24g

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 0/4] Fix a crash when block device is read and block size is changed at the same time
  2012-09-17 21:19                               ` Jeff Moyer
@ 2012-09-18 17:04                                 ` Mikulas Patocka
  2012-09-18 17:22                                   ` Jeff Moyer
  2012-09-18 20:11                                   ` Jeff Moyer
  0 siblings, 2 replies; 45+ messages in thread
From: Mikulas Patocka @ 2012-09-18 17:04 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Eric Dumazet, Jens Axboe, Andrea Arcangeli, Jan Kara, dm-devel,
	linux-kernel, Alexander Viro, kosaki.motohiro, linux-fsdevel,
	lwoodman, Alasdair G. Kergon

Hi Jeff

Thanks for testing.

It would be interesting ... what happens if you take the patch 3, leave 
"struct percpu_rw_semaphore bd_block_size_semaphore" in "struct 
block_device", but remove any use of the semaphore from fs/block_dev.c? - 
will the performance be like unpatched kernel or like patch 3? It could be 
that the change in the alignment affects performance on your CPU too, just 
differently than on my CPU.

What is the CPU model that you used for testing?

Mikulas


On Mon, 17 Sep 2012, Jeff Moyer wrote:

> Jeff Moyer <jmoyer@redhat.com> writes:
> > Mikulas Patocka <mpatocka@redhat.com> writes:
> >> I would be interested if other people did performance testing of the 
> >> patches too.
> >
> > I'll do some testing next week, but don't expect to get to it before
> > Wednesday.
> 
> Sorry for taking so long on this.  I managed to get access to an 80cpu
> (160 threads) system with 1TB of memory.  I installed a pcie ssd into
> this machine and did some testing against the raw block device.
> 
> I've attached the fio job file I used.  Basically, I tested sequential
> reads, sequential writes, random reads, random writes, and then a mix of
> sequential reads and writes, and a mix of random reads and writes.  All
> tests used direct I/O to the block device, and each number shown is an
> average of 5 runs.  I had to pin the fio processes to the same numa node
> as the pcie adapter in order to get low run-to-run variations.  Because
> of the numa factor, I was unable to get reliable results running
> processes against all of the 160 threads on the system.  The runs below
> have 4 processes, each pushing a queue depth of 1024.
> 
> So, on to the results.  I haven't fully investigated them yet, but I
> plan to as they are rather surprising.
> 
> The first patch in the series simply adds a semaphore to the
> block_device structure.  Mikulas, you had mentioned that this managed to
> have a large effect on your test load.  In my case, this didn't seem to
> make any difference at all:
> 
>       3.6.0-rc5+-job.fio-run2/output-avg
>           	        READ                WRITE                 CPU          
>   Job Name	     BW   IOPS  msec      BW   IOPS  msec   usr  sys   csw
>     write1	      0      0     0  748522 187130 44864 16.34 60.65 3799440.00
>      read1	 690615 172653 48602       0      0     0 13.45 61.42 4044720.00
> randwrite1	      0      0     0  716406 179101 46839 29.03 52.79 3151140.00
>  randread1	 683466 170866 49108       0      0     0 25.92 54.67 3081610.00
> readwrite1	 377518  94379 44450  377645  94410 44450 15.49 64.32 3139240.00
>    randrw1	 355815  88953 47178  355733  88933 47178 27.96 54.24 2944570.00
> 3.6.0-rc5.mikulas.1+-job.fio-run2/output-avg
>           	        READ                WRITE                 CPU          
>   Job Name	     BW   IOPS  msec      BW   IOPS  msec   usr  sys   csw
>     write1	      0      0     0  764037 191009 43925 17.14 60.15 3536950.00
>      read1	 696880 174220 48152       0      0     0 13.90 61.74 3710168.00
> randwrite1	      0      0     0  737331 184332 45511 29.82 52.71 2869440.00
>  randread1	 689319 172329 48684       0      0     0 26.38 54.58 2927411.00
> readwrite1	 387651  96912 43294  387799  96949 43294 16.06 64.92 2814340.00
>    randrw1	 360298  90074 46591  360304  90075 46591 28.53 54.10 2793120.00
>                                    %diff
>           	        READ                WRITE                 CPU          
>   Job Name	     BW   IOPS  msec      BW   IOPS  msec   usr  sys   csw
>     write1	      0      0     0       0      0     0 0.00 0.00  -6.91
>      read1	      0      0     0       0      0     0 0.00 0.00  -8.27
> randwrite1	      0      0     0       0      0     0 0.00 0.00  -8.94
>  randread1	      0      0     0       0      0     0 0.00 0.00  -5.00
> readwrite1	      0      0     0       0      0     0 0.00 0.00 -10.35
>    randrw1	      0      0     0       0      0     0 0.00 0.00  -5.14
> 
> 
> The headings are:
> BW = bandwidth in KB/s
> IOPS = I/Os per second
> msec = number of miliseconds the run took (smaller is better)
> usr = %user time
> sys = %system time
> csw = context switches
> 
> The first two tables show the results of each run.  In this case, the
> first is the unpatched kernel, and the second is the one with the
> block_device structure change.  The third table is the % difference
> between the two.  A positive number indicates the second run had a
> larger average than the first.  I found that the context switch rate was
> rather unpredictable, so I really should have just left that out of the
> reporting.
> 
> As you can see, adding a member to struct block_device did not really
> change the results.
> 
> 
> Next up is the patch that actually uses the rw semaphore to protect
> access to the block size.  Here are the results:
> 
> 3.6.0-rc5.mikulas.1+-job.fio-run2/output-avg
>           	        READ                WRITE                 CPU          
>   Job Name	     BW   IOPS  msec      BW   IOPS  msec   usr  sys   csw
>     write1	      0      0     0  764037 191009 43925 17.14 60.15 3536950.00
>      read1	 696880 174220 48152       0      0     0 13.90 61.74 3710168.00
> randwrite1	      0      0     0  737331 184332 45511 29.82 52.71 2869440.00
>  randread1	 689319 172329 48684       0      0     0 26.38 54.58 2927411.00
> readwrite1	 387651  96912 43294  387799  96949 43294 16.06 64.92 2814340.00
>    randrw1	 360298  90074 46591  360304  90075 46591 28.53 54.10 2793120.00
> 3.6.0-rc5.mikulas.2+-job.fio-run2/output-avg
>           	        READ                WRITE                 CPU          
>   Job Name	     BW   IOPS  msec      BW   IOPS  msec   usr  sys   csw
>     write1	      0      0     0  816713 204178 41108 16.60 62.06 3159574.00
>      read1	 749437 187359 44800       0      0     0 13.91 63.69 3190050.00
> randwrite1	      0      0     0  747534 186883 44941 29.96 53.23 2617590.00
>  randread1	 734627 183656 45699       0      0     0 27.02 56.27 2403191.00
> readwrite1	 396113  99027 42397  396120  99029 42397 14.50 63.21 3460140.00
>    randrw1	 374408  93601 44806  374556  93638 44806 28.46 54.33 2688985.00
>                                    %diff
>           	        READ                WRITE                 CPU          
>   Job Name	     BW   IOPS  msec      BW   IOPS  msec   usr  sys   csw
>     write1	      0      0     0       6      6    -6 0.00 0.00 -10.67
>      read1	      7      7    -6       0      0     0 0.00 0.00 -14.02
> randwrite1	      0      0     0       0      0     0 0.00 0.00  -8.78
>  randread1	      6      6    -6       0      0     0 0.00 0.00 -17.91
> readwrite1	      0      0     0       0      0     0 -9.71 0.00  22.95
>    randrw1	      0      0     0       0      0     0 0.00 0.00   0.00
> 
> As you can see, there were modest gains in write, read, and randread.
> This is somewhat unexpected, as you would think that introducing locking
> would not *help* performance!  Investigating the standard deviations for
> each set of 5 runs shows that the performance difference is significant
> (the standard deviation is reported as a percentage of the average):
> 
> This is a table of standard deviations for the 5 runs comprising the
> above average with this kernel: 3.6.0-rc5.mikulas.1+
> 
>           	        READ                WRITE                 CPU          
>   Job Name	     BW   IOPS  msec      BW   IOPS  msec   usr  sys   csw
>     write1	      0      0     0       1      1     1 2.99 1.27   9.10
>      read1	      0      0     0       0      0     0 2.12 0.53   5.03
> randwrite1	      0      0     0       0      0     0 1.25 0.49   5.52
>  randread1	      1      1     1       0      0     0 1.81 1.18  10.04
> readwrite1	      2      2     2       2      2     2 11.35 1.86  26.83
>    randrw1	      2      2     2       2      2     2 4.01 2.71  22.72
> 
> And here are the standard deviations for the .2+ kernel:
> 
>           	        READ                WRITE                 CPU          
>   Job Name	     BW   IOPS  msec      BW   IOPS  msec   usr  sys   csw
>     write1	      0      0     0       2      2     2 3.33 2.95   7.88
>      read1	      2      2     2       0      0     0 6.44 2.30  19.27
> randwrite1	      0      0     0       3      3     3 0.18 0.52   1.71
>  randread1	      2      2     2       0      0     0 3.72 2.34  23.70
> readwrite1	      3      3     3       3      3     3 3.35 2.61   7.38
>    randrw1	      1      1     1       1      1     1 1.80 1.00   9.73
> 
> 
> Next, we'll move on to the third patch in the series, which converts the
> rw semaphore to a per-cpu semaphore.
> 
> 3.6.0-rc5.mikulas.2+-job.fio-run2/output-avg
>           	        READ                WRITE                 CPU          
>   Job Name	     BW   IOPS  msec      BW   IOPS  msec   usr  sys   csw
>     write1	      0      0     0  816713 204178 41108 16.60 62.06 3159574.00
>      read1	 749437 187359 44800       0      0     0 13.91 63.69 3190050.00
> randwrite1	      0      0     0  747534 186883 44941 29.96 53.23 2617590.00
>  randread1	 734627 183656 45699       0      0     0 27.02 56.27 2403191.00
> readwrite1	 396113  99027 42397  396120  99029 42397 14.50 63.21 3460140.00
>    randrw1	 374408  93601 44806  374556  93638 44806 28.46 54.33 2688985.00
> 3.6.0-rc5.mikulas.3+-job.fio-run2/output-avg
>           	        READ                WRITE                 CPU          
>   Job Name	     BW   IOPS  msec      BW   IOPS  msec   usr  sys   csw
>     write1	      0      0     0  870892 217723 38528 17.83 41.57 1697870.00
>      read1	1430164 357541 23462       0      0     0 14.41 56.00 241315.00
> randwrite1	      0      0     0  759789 189947 44163 31.48 36.36 1256040.00
>  randread1	1043830 260958 32146       0      0     0 31.89 44.39 185032.00
> readwrite1	 692567 173141 24226  692489 173122 24226 18.65 53.64 311255.00
>    randrw1	 501208 125302 33469  501446 125361 33469 35.40 41.61 246391.00
>                                    %diff
>           	        READ                WRITE                 CPU          
>   Job Name	     BW   IOPS  msec      BW   IOPS  msec   usr  sys   csw
>     write1	      0      0     0       6      6    -6 7.41 -33.02 -46.26
>      read1	     90     90   -47       0      0     0 0.00 -12.07 -92.44
> randwrite1	      0      0     0       0      0     0 5.07 -31.69 -52.02
>  randread1	     42     42   -29       0      0     0 18.02 -21.11 -92.30
> readwrite1	     74     74   -42      74     74   -42 28.62 -15.14 -91.00
>    randrw1	     33     33   -25      33     33   -25 24.39 -23.41 -90.84
> 
> Wow!  Switching to the per-cpu semaphore implementation just boosted the
> performance of the I/O path big-time.  Note that the system time also
> goes down!  So, we get better throughput and less system time.  This
> sounds too good to be true.  ;-)  Here are the standard deviations
> (again, shown as percentages) for the .3+ kernel:
> 
>           	        READ                WRITE                 CPU          
>   Job Name	     BW   IOPS  msec      BW   IOPS  msec   usr  sys   csw
>     write1	      0      0     0       0      0     0 0.96 0.19   1.03
>      read1	      0      0     0       0      0     0 1.82 0.24   2.46
> randwrite1	      0      0     0       0      0     0 0.40 0.39   0.68
>  randread1	      0      0     0       0      0     0 0.53 0.31   2.02
> readwrite1	      0      0     0       0      0     0 2.73 4.07  33.27
>    randrw1	      1      1     1       1      1     1 0.40 0.10   3.29
> 
> Again, there's no slop there, so the results are very reproducible.
> 
> Finally, the last patch changes to an rcu-based rw semaphore
> implementation.  Here are the results for that, as compared with the
> previous kernel:
> 
> 3.6.0-rc5.mikulas.3+-job.fio-run2/output-avg
>           	        READ                WRITE                 CPU          
>   Job Name	     BW   IOPS  msec      BW   IOPS  msec   usr  sys   csw
>     write1	      0      0     0  870892 217723 38528 17.83 41.57 1697870.00
>      read1	1430164 357541 23462       0      0     0 14.41 56.00 241315.00
> randwrite1	      0      0     0  759789 189947 44163 31.48 36.36 1256040.00
>  randread1	1043830 260958 32146       0      0     0 31.89 44.39 185032.00
> readwrite1	 692567 173141 24226  692489 173122 24226 18.65 53.64 311255.00
>    randrw1	 501208 125302 33469  501446 125361 33469 35.40 41.61 246391.00
> 3.6.0-rc5.mikulas.4+-job.fio-run2/output-avg
>           	        READ                WRITE                 CPU          
>   Job Name	     BW   IOPS  msec      BW   IOPS  msec   usr  sys   csw
>     write1	      0      0     0  812659 203164 41309 16.80 61.71 3208620.00
>      read1	 739061 184765 45442       0      0     0 14.32 62.85 3375484.00
> randwrite1	      0      0     0  726971 181742 46192 30.00 52.33 2736270.00
>  randread1	 719040 179760 46683       0      0     0 26.47 54.78 2914080.00
> readwrite1	 396670  99167 42309  396619  99154 42309 14.91 63.12 3412220.00
>    randrw1	 374790  93697 44766  374807  93701 44766 28.42 54.10 2774690.00
>                                    %diff
>           	        READ                WRITE                 CPU          
>   Job Name	     BW   IOPS  msec      BW   IOPS  msec   usr  sys   csw
>     write1	      0      0     0      -6     -6     7 -5.78 48.45  88.98
>      read1	    -48    -48    93       0      0     0 0.00 12.23 1298.79
> randwrite1	      0      0     0       0      0     0 0.00 43.92 117.85
>  randread1	    -31    -31    45       0      0     0 -17.00 23.41 1474.91
> readwrite1	    -42    -42    74     -42    -42    74 -20.05 17.67 996.28
>    randrw1	    -25    -25    33     -25    -25    33 -19.72 30.02
>    1026.13
> 
> And we've lost a good bit of performance!  Talk about
> counter-intuitive.  Here are the standard deviation numbers:
> 
>           	        READ                WRITE                 CPU          
>   Job Name	     BW   IOPS  msec      BW   IOPS  msec   usr  sys   csw
>     write1	      0      0     0       2      2     2 2.96 3.00   6.79
>      read1	      3      3     3       0      0     0 6.52 2.82  21.86
> randwrite1	      0      0     0       2      2     2 0.71 0.55   4.07
>  randread1	      1      1     1       0      0     0 4.13 2.31  20.12
> readwrite1	      1      1     1       1      1     1 4.14 2.64   6.12
>    randrw1	      0      0     0       0      0     0 0.59 0.25   2.99
> 
> 
> Here is a comparison of the vanilla kernel versus the best performing
> patch in this series (patch 3 of 4):
> 
>       3.6.0-rc5+-job.fio-run2/output-avg
>           	        READ                WRITE                 CPU          
>   Job Name	     BW   IOPS  msec      BW   IOPS  msec   usr  sys   csw
>     write1	      0      0     0  748522 187130 44864 16.34 60.65 3799440.00
>      read1	 690615 172653 48602       0      0     0 13.45 61.42 4044720.00
> randwrite1	      0      0     0  716406 179101 46839 29.03 52.79 3151140.00
>  randread1	 683466 170866 49108       0      0     0 25.92 54.67 3081610.00
> readwrite1	 377518  94379 44450  377645  94410 44450 15.49 64.32 3139240.00
>    randrw1	 355815  88953 47178  355733  88933 47178 27.96 54.24 2944570.00
> 3.6.0-rc5.mikulas.3+-job.fio-run2/output-avg
>           	        READ                WRITE                 CPU          
>   Job Name	     BW   IOPS  msec      BW   IOPS  msec   usr  sys   csw
>     write1	      0      0     0  870892 217723 38528 17.83 41.57 1697870.00
>      read1	1430164 357541 23462       0      0     0 14.41 56.00 241315.00
> randwrite1	      0      0     0  759789 189947 44163 31.48 36.36 1256040.00
>  randread1	1043830 260958 32146       0      0     0 31.89 44.39 185032.00
> readwrite1	 692567 173141 24226  692489 173122 24226 18.65 53.64 311255.00
>    randrw1	 501208 125302 33469  501446 125361 33469 35.40 41.61 246391.00
>                                    %diff
>           	        READ                WRITE                 CPU          
>   Job Name	     BW   IOPS  msec      BW   IOPS  msec   usr  sys   csw
>     write1	      0      0     0      16     16   -14 9.12 -31.46 -55.31
>      read1	    107    107   -51       0      0     0 7.14 -8.82 -94.03
> randwrite1	      0      0     0       6      6    -5 8.44 -31.12 -60.14
>  randread1	     52     52   -34       0      0     0 23.03 -18.80 -94.00
> readwrite1	     83     83   -45      83     83   -45 20.40 -16.60 -90.09
>    randrw1	     40     40   -29      40     40   -29 26.61 -23.29 -91.63
> 
> 
> Next up, I'm going to get some perf and blktrace data from these runs to
> see if I can identify why there is such a drastic change in
> performance.  I will also attempt to run the tests against a different
> vendor's adapter, and maybe against some FC storage if I can set that up.
> 
> Cheers,
> Jeff
> 
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 0/4] Fix a crash when block device is read and block size is changed at the same time
  2012-09-18 17:04                                 ` Mikulas Patocka
@ 2012-09-18 17:22                                   ` Jeff Moyer
  2012-09-18 18:55                                     ` Mikulas Patocka
  2012-09-18 20:11                                   ` Jeff Moyer
  1 sibling, 1 reply; 45+ messages in thread
From: Jeff Moyer @ 2012-09-18 17:22 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Eric Dumazet, Jens Axboe, Andrea Arcangeli, Jan Kara, dm-devel,
	linux-kernel, Alexander Viro, kosaki.motohiro, linux-fsdevel,
	lwoodman, Alasdair G. Kergon

Mikulas Patocka <mpatocka@redhat.com> writes:

> Hi Jeff
>
> Thanks for testing.
>
> It would be interesting ... what happens if you take the patch 3, leave 
> "struct percpu_rw_semaphore bd_block_size_semaphore" in "struct 
> block_device", but remove any use of the semaphore from fs/block_dev.c? - 
> will the performance be like unpatched kernel or like patch 3? It could be 
> that the change in the alignment affects performance on your CPU too, just 
> differently than on my CPU.

I'll give it a try and report back.

> What is the CPU model that you used for testing?

http://ark.intel.com/products/53570/Intel-Xeon-Processor-E7-2860-%2824M-Cache-2_26-GHz-6_40-GTs-Intel-QPI%29

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 0/4] Fix a crash when block device is read and block size is changed at the same time
  2012-09-18 17:22                                   ` Jeff Moyer
@ 2012-09-18 18:55                                     ` Mikulas Patocka
  2012-09-18 18:58                                       ` Jeff Moyer
  0 siblings, 1 reply; 45+ messages in thread
From: Mikulas Patocka @ 2012-09-18 18:55 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Eric Dumazet, Jens Axboe, Andrea Arcangeli, Jan Kara, dm-devel,
	linux-kernel, Alexander Viro, kosaki.motohiro, linux-fsdevel,
	lwoodman, Alasdair G. Kergon



On Tue, 18 Sep 2012, Jeff Moyer wrote:

> Mikulas Patocka <mpatocka@redhat.com> writes:
> 
> > Hi Jeff
> >
> > Thanks for testing.
> >
> > It would be interesting ... what happens if you take the patch 3, leave 
> > "struct percpu_rw_semaphore bd_block_size_semaphore" in "struct 
> > block_device", but remove any use of the semaphore from fs/block_dev.c? - 
> > will the performance be like unpatched kernel or like patch 3? It could be 
> > that the change in the alignment affects performance on your CPU too, just 
> > differently than on my CPU.
> 
> I'll give it a try and report back.
> 
> > What is the CPU model that you used for testing?
> 
> http://ark.intel.com/products/53570/Intel-Xeon-Processor-E7-2860-%2824M-Cache-2_26-GHz-6_40-GTs-Intel-QPI%29
> 
> Cheers,
> Jeff

BTW. why did you use just 4 processes? - the processor has 10 cores and 20 
threads (so theoretically, you could run 20 processes bound on a single 
numa node). Were the results not stable with more than 4 processes?

Mikulas

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 0/4] Fix a crash when block device is read and block size is changed at the same time
  2012-09-18 18:55                                     ` Mikulas Patocka
@ 2012-09-18 18:58                                       ` Jeff Moyer
  0 siblings, 0 replies; 45+ messages in thread
From: Jeff Moyer @ 2012-09-18 18:58 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Eric Dumazet, Jens Axboe, Andrea Arcangeli, Jan Kara, dm-devel,
	linux-kernel, Alexander Viro, kosaki.motohiro, linux-fsdevel,
	lwoodman, Alasdair G. Kergon

Mikulas Patocka <mpatocka@redhat.com> writes:

> On Tue, 18 Sep 2012, Jeff Moyer wrote:
>
>> Mikulas Patocka <mpatocka@redhat.com> writes:
>> 
>> > Hi Jeff
>> >
>> > Thanks for testing.
>> >
>> > It would be interesting ... what happens if you take the patch 3, leave 
>> > "struct percpu_rw_semaphore bd_block_size_semaphore" in "struct 
>> > block_device", but remove any use of the semaphore from fs/block_dev.c? - 
>> > will the performance be like unpatched kernel or like patch 3? It could be 
>> > that the change in the alignment affects performance on your CPU too, just 
>> > differently than on my CPU.
>> 
>> I'll give it a try and report back.
>> 
>> > What is the CPU model that you used for testing?
>> 
>> http://ark.intel.com/products/53570/Intel-Xeon-Processor-E7-2860-%2824M-Cache-2_26-GHz-6_40-GTs-Intel-QPI%29
>> 
> BTW. why did you use just 4 processes? - the processor has 10 cores and 20 
> threads (so theoretically, you could run 20 processes bound on a single 
> numa node). Were the results not stable with more than 4 processes?

There is no good reason for it.  Since I was able to show some
differences in performance, I didn't see the need to scale beyond 4.  I
can certainly bump the count up if/when that becomes interesting.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 0/4] Fix a crash when block device is read and block size is changed at the same time
  2012-09-18 17:04                                 ` Mikulas Patocka
  2012-09-18 17:22                                   ` Jeff Moyer
@ 2012-09-18 20:11                                   ` Jeff Moyer
  2012-09-25 17:49                                     ` Jeff Moyer
  1 sibling, 1 reply; 45+ messages in thread
From: Jeff Moyer @ 2012-09-18 20:11 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Eric Dumazet, Jens Axboe, Andrea Arcangeli, Jan Kara, dm-devel,
	linux-kernel, Alexander Viro, kosaki.motohiro, linux-fsdevel,
	lwoodman, Alasdair G. Kergon

Mikulas Patocka <mpatocka@redhat.com> writes:

> Hi Jeff
>
> Thanks for testing.
>
> It would be interesting ... what happens if you take the patch 3, leave 
> "struct percpu_rw_semaphore bd_block_size_semaphore" in "struct 
> block_device", but remove any use of the semaphore from fs/block_dev.c? - 
> will the performance be like unpatched kernel or like patch 3? It could be 
> that the change in the alignment affects performance on your CPU too, just 
> differently than on my CPU.

It turns out to be exactly the same performance as with the 3rd patch
applied, so I guess it does have something to do with cache alignment.
Here is the patch (against vanilla) I ended up testing.  Let me know if
I've botched it somehow.

So, I next up I'll play similar tricks to what you did (padding struct
block_device in all kernels) to eliminate the differences due to
structure alignment and provide a clear picture of what the locking
effects are.

Thanks!
Jeff


diff --git a/drivers/char/raw.c b/drivers/char/raw.c
index 54a3a6d..0bb207e 100644
--- a/drivers/char/raw.c
+++ b/drivers/char/raw.c
@@ -285,7 +285,7 @@ static long raw_ctl_compat_ioctl(struct file *file, unsigned int cmd,
 
 static const struct file_operations raw_fops = {
 	.read		= do_sync_read,
-	.aio_read	= generic_file_aio_read,
+	.aio_read	= blkdev_aio_read,
 	.write		= do_sync_write,
 	.aio_write	= blkdev_aio_write,
 	.fsync		= blkdev_fsync,
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 38e721b..c7514b5 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -116,6 +116,8 @@ EXPORT_SYMBOL(invalidate_bdev);
 
 int set_blocksize(struct block_device *bdev, int size)
 {
+	struct address_space *mapping;
+
 	/* Size must be a power of two, and between 512 and PAGE_SIZE */
 	if (size > PAGE_SIZE || size < 512 || !is_power_of_2(size))
 		return -EINVAL;
@@ -124,6 +126,16 @@ int set_blocksize(struct block_device *bdev, int size)
 	if (size < bdev_logical_block_size(bdev))
 		return -EINVAL;
 
+	/* Check that the block device is not memory mapped */
+	mapping = bdev->bd_inode->i_mapping;
+	mutex_lock(&mapping->i_mmap_mutex);
+	if (!prio_tree_empty(&mapping->i_mmap) ||
+	    !list_empty(&mapping->i_mmap_nonlinear)) {
+		mutex_unlock(&mapping->i_mmap_mutex);
+		return -EBUSY;
+	}
+	mutex_unlock(&mapping->i_mmap_mutex);
+
 	/* Don't change the size if it is same as current */
 	if (bdev->bd_block_size != size) {
 		sync_blockdev(bdev);
@@ -131,6 +143,7 @@ int set_blocksize(struct block_device *bdev, int size)
 		bdev->bd_inode->i_blkbits = blksize_bits(size);
 		kill_bdev(bdev);
 	}
+
 	return 0;
 }
 
@@ -441,6 +454,12 @@ static struct inode *bdev_alloc_inode(struct super_block *sb)
 	struct bdev_inode *ei = kmem_cache_alloc(bdev_cachep, GFP_KERNEL);
 	if (!ei)
 		return NULL;
+
+	if (unlikely(percpu_init_rwsem(&ei->bdev.bd_block_size_semaphore))) {
+		kmem_cache_free(bdev_cachep, ei);
+		return NULL;
+	}
+
 	return &ei->vfs_inode;
 }
 
@@ -449,6 +468,8 @@ static void bdev_i_callback(struct rcu_head *head)
 	struct inode *inode = container_of(head, struct inode, i_rcu);
 	struct bdev_inode *bdi = BDEV_I(inode);
 
+	percpu_free_rwsem(&bdi->bdev.bd_block_size_semaphore);
+
 	kmem_cache_free(bdev_cachep, bdi);
 }
 
@@ -1567,6 +1588,19 @@ static long block_ioctl(struct file *file, unsigned cmd, unsigned long arg)
 	return blkdev_ioctl(bdev, mode, cmd, arg);
 }
 
+ssize_t blkdev_aio_read(struct kiocb *iocb, const struct iovec *iov,
+			unsigned long nr_segs, loff_t pos)
+{
+	ssize_t ret;
+	struct block_device *bdev = I_BDEV(iocb->ki_filp->f_mapping->host);
+	percpu_rwsem_ptr p;
+
+	ret = generic_file_aio_read(iocb, iov, nr_segs, pos);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(blkdev_aio_read);
+
 /*
  * Write data to the block device.  Only intended for the block device itself
  * and the raw driver which basically is a fake block device.
@@ -1578,6 +1612,7 @@ ssize_t blkdev_aio_write(struct kiocb *iocb, const struct iovec *iov,
 			 unsigned long nr_segs, loff_t pos)
 {
 	struct file *file = iocb->ki_filp;
+	struct block_device *bdev = I_BDEV(file->f_mapping->host);
 	struct blk_plug plug;
 	ssize_t ret;
 
@@ -1597,6 +1632,16 @@ ssize_t blkdev_aio_write(struct kiocb *iocb, const struct iovec *iov,
 }
 EXPORT_SYMBOL_GPL(blkdev_aio_write);
 
+int blkdev_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	int ret;
+	struct block_device *bdev = I_BDEV(file->f_mapping->host);
+
+	ret = generic_file_mmap(file, vma);
+
+	return ret;
+}
+
 /*
  * Try to release a page associated with block device when the system
  * is under memory pressure.
@@ -1627,9 +1672,9 @@ const struct file_operations def_blk_fops = {
 	.llseek		= block_llseek,
 	.read		= do_sync_read,
 	.write		= do_sync_write,
-  	.aio_read	= generic_file_aio_read,
+  	.aio_read	= blkdev_aio_read,
 	.aio_write	= blkdev_aio_write,
-	.mmap		= generic_file_mmap,
+	.mmap		= blkdev_mmap,
 	.fsync		= blkdev_fsync,
 	.unlocked_ioctl	= block_ioctl,
 #ifdef CONFIG_COMPAT
diff --git a/include/linux/fs.h b/include/linux/fs.h
index aa11047..15c481d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -10,6 +10,7 @@
 #include <linux/ioctl.h>
 #include <linux/blk_types.h>
 #include <linux/types.h>
+#include <linux/percpu-rwsem.h>
 
 /*
  * It's silly to have NR_OPEN bigger than NR_FILE, but you can change
@@ -724,6 +725,8 @@ struct block_device {
 	int			bd_fsfreeze_count;
 	/* Mutex for freeze */
 	struct mutex		bd_fsfreeze_mutex;
+	/* A semaphore that prevents I/O while block size is being changed */
+	struct percpu_rw_semaphore	bd_block_size_semaphore;
 };
 
 /*
@@ -2564,6 +2567,8 @@ extern int generic_segment_checks(const struct iovec *iov,
 		unsigned long *nr_segs, size_t *count, int access_flags);
 
 /* fs/block_dev.c */
+extern ssize_t blkdev_aio_read(struct kiocb *iocb, const struct iovec *iov,
+			       unsigned long nr_segs, loff_t pos);
 extern ssize_t blkdev_aio_write(struct kiocb *iocb, const struct iovec *iov,
 				unsigned long nr_segs, loff_t pos);
 extern int blkdev_fsync(struct file *filp, loff_t start, loff_t end,

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH 0/4] Fix a crash when block device is read and block size is changed at the same time
  2012-09-18 20:11                                   ` Jeff Moyer
@ 2012-09-25 17:49                                     ` Jeff Moyer
  2012-09-25 17:59                                       ` Jens Axboe
  2012-09-25 22:58                                       ` [PATCH 0/4] " Mikulas Patocka
  0 siblings, 2 replies; 45+ messages in thread
From: Jeff Moyer @ 2012-09-25 17:49 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Eric Dumazet, Jens Axboe, Andrea Arcangeli, Jan Kara, dm-devel,
	linux-kernel, Alexander Viro, kosaki.motohiro, linux-fsdevel,
	lwoodman, Alasdair G. Kergon

Jeff Moyer <jmoyer@redhat.com> writes:

> Mikulas Patocka <mpatocka@redhat.com> writes:
>
>> Hi Jeff
>>
>> Thanks for testing.
>>
>> It would be interesting ... what happens if you take the patch 3, leave 
>> "struct percpu_rw_semaphore bd_block_size_semaphore" in "struct 
>> block_device", but remove any use of the semaphore from fs/block_dev.c? - 
>> will the performance be like unpatched kernel or like patch 3? It could be 
>> that the change in the alignment affects performance on your CPU too, just 
>> differently than on my CPU.
>
> It turns out to be exactly the same performance as with the 3rd patch
> applied, so I guess it does have something to do with cache alignment.
> Here is the patch (against vanilla) I ended up testing.  Let me know if
> I've botched it somehow.
>
> So, I next up I'll play similar tricks to what you did (padding struct
> block_device in all kernels) to eliminate the differences due to
> structure alignment and provide a clear picture of what the locking
> effects are.

After trying again with the same padding you used in the struct
bdev_inode, I see no performance differences between any of the
patches.  I tried bumping up the number of threads to saturate the
number of cpus on a single NUMA node on my hardware, but that resulted
in lower IOPS to the device, and hence consumption of less CPU time.
So, I believe my results to be inconclusive.

After talking with Vivek about the problem, he had mentioned that it
might be worth investigating whether bd_block_size could be protected
using SRCU.  I looked into it, and the one thing I couldn't reconcile is
updating both the bd_block_size and the inode->i_blkbits at the same
time.  It would involve (afaiui) adding fields to both the inode and the
block_device data structures and using rcu_assign_pointer  and
rcu_dereference to modify and access the fields, and both fields would
need to protected by the same struct srcu_struct.  I'm not sure whether
that's a desirable approach.  When I started to implement it, it got
ugly pretty quickly.  What do others think?

For now, my preference is to get the full patch set in.  I will continue
to investigate the performance impact of the data structure size changes
that I've been seeing.

So, for the four patches:

Acked-by: Jeff Moyer <jmoyer@redhat.com>

Jens, can you have a look at the patch set?  We are seeing problem
reports of this in the wild[1][2].

Cheers,
Jeff

[1] https://bugzilla.redhat.com/show_bug.cgi?id=824107
[2] https://bugzilla.redhat.com/show_bug.cgi?id=812129

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 0/4] Fix a crash when block device is read and block size is changed at the same time
  2012-09-25 17:49                                     ` Jeff Moyer
@ 2012-09-25 17:59                                       ` Jens Axboe
  2012-09-25 18:11                                         ` Jens Axboe
  2012-09-25 22:58                                       ` [PATCH 0/4] " Mikulas Patocka
  1 sibling, 1 reply; 45+ messages in thread
From: Jens Axboe @ 2012-09-25 17:59 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Mikulas Patocka, Eric Dumazet, Andrea Arcangeli, Jan Kara,
	dm-devel, linux-kernel, Alexander Viro, kosaki.motohiro,
	linux-fsdevel, lwoodman, Alasdair G. Kergon

On 2012-09-25 19:49, Jeff Moyer wrote:
> Jeff Moyer <jmoyer@redhat.com> writes:
> 
>> Mikulas Patocka <mpatocka@redhat.com> writes:
>>
>>> Hi Jeff
>>>
>>> Thanks for testing.
>>>
>>> It would be interesting ... what happens if you take the patch 3, leave 
>>> "struct percpu_rw_semaphore bd_block_size_semaphore" in "struct 
>>> block_device", but remove any use of the semaphore from fs/block_dev.c? - 
>>> will the performance be like unpatched kernel or like patch 3? It could be 
>>> that the change in the alignment affects performance on your CPU too, just 
>>> differently than on my CPU.
>>
>> It turns out to be exactly the same performance as with the 3rd patch
>> applied, so I guess it does have something to do with cache alignment.
>> Here is the patch (against vanilla) I ended up testing.  Let me know if
>> I've botched it somehow.
>>
>> So, I next up I'll play similar tricks to what you did (padding struct
>> block_device in all kernels) to eliminate the differences due to
>> structure alignment and provide a clear picture of what the locking
>> effects are.
> 
> After trying again with the same padding you used in the struct
> bdev_inode, I see no performance differences between any of the
> patches.  I tried bumping up the number of threads to saturate the
> number of cpus on a single NUMA node on my hardware, but that resulted
> in lower IOPS to the device, and hence consumption of less CPU time.
> So, I believe my results to be inconclusive.
> 
> After talking with Vivek about the problem, he had mentioned that it
> might be worth investigating whether bd_block_size could be protected
> using SRCU.  I looked into it, and the one thing I couldn't reconcile is
> updating both the bd_block_size and the inode->i_blkbits at the same
> time.  It would involve (afaiui) adding fields to both the inode and the
> block_device data structures and using rcu_assign_pointer  and
> rcu_dereference to modify and access the fields, and both fields would
> need to protected by the same struct srcu_struct.  I'm not sure whether
> that's a desirable approach.  When I started to implement it, it got
> ugly pretty quickly.  What do others think?
> 
> For now, my preference is to get the full patch set in.  I will continue
> to investigate the performance impact of the data structure size changes
> that I've been seeing.
> 
> So, for the four patches:
> 
> Acked-by: Jeff Moyer <jmoyer@redhat.com>
> 
> Jens, can you have a look at the patch set?  We are seeing problem
> reports of this in the wild[1][2].

I'll queue it up for 3.7. I can run my regular testing on the 8-way, it
has a nack for showing scaling problems very nicely in aio/dio. As long
as we're not adding per-inode cache line dirtying per IO (and the
per-cpu rw sem looks OK), then I don't think there's too much to worry
about.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 0/4] Fix a crash when block device is read and block size is changed at the same time
  2012-09-25 17:59                                       ` Jens Axboe
@ 2012-09-25 18:11                                         ` Jens Axboe
  2012-09-25 22:49                                           ` [PATCH 1/2] " Mikulas Patocka
  2012-09-25 22:50                                           ` [PATCH 2/2] " Mikulas Patocka
  0 siblings, 2 replies; 45+ messages in thread
From: Jens Axboe @ 2012-09-25 18:11 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Mikulas Patocka, Eric Dumazet, Andrea Arcangeli, Jan Kara,
	dm-devel, linux-kernel, Alexander Viro, kosaki.motohiro,
	linux-fsdevel, lwoodman, Alasdair G. Kergon

On 2012-09-25 19:59, Jens Axboe wrote:
> On 2012-09-25 19:49, Jeff Moyer wrote:
>> Jeff Moyer <jmoyer@redhat.com> writes:
>>
>>> Mikulas Patocka <mpatocka@redhat.com> writes:
>>>
>>>> Hi Jeff
>>>>
>>>> Thanks for testing.
>>>>
>>>> It would be interesting ... what happens if you take the patch 3, leave 
>>>> "struct percpu_rw_semaphore bd_block_size_semaphore" in "struct 
>>>> block_device", but remove any use of the semaphore from fs/block_dev.c? - 
>>>> will the performance be like unpatched kernel or like patch 3? It could be 
>>>> that the change in the alignment affects performance on your CPU too, just 
>>>> differently than on my CPU.
>>>
>>> It turns out to be exactly the same performance as with the 3rd patch
>>> applied, so I guess it does have something to do with cache alignment.
>>> Here is the patch (against vanilla) I ended up testing.  Let me know if
>>> I've botched it somehow.
>>>
>>> So, I next up I'll play similar tricks to what you did (padding struct
>>> block_device in all kernels) to eliminate the differences due to
>>> structure alignment and provide a clear picture of what the locking
>>> effects are.
>>
>> After trying again with the same padding you used in the struct
>> bdev_inode, I see no performance differences between any of the
>> patches.  I tried bumping up the number of threads to saturate the
>> number of cpus on a single NUMA node on my hardware, but that resulted
>> in lower IOPS to the device, and hence consumption of less CPU time.
>> So, I believe my results to be inconclusive.
>>
>> After talking with Vivek about the problem, he had mentioned that it
>> might be worth investigating whether bd_block_size could be protected
>> using SRCU.  I looked into it, and the one thing I couldn't reconcile is
>> updating both the bd_block_size and the inode->i_blkbits at the same
>> time.  It would involve (afaiui) adding fields to both the inode and the
>> block_device data structures and using rcu_assign_pointer  and
>> rcu_dereference to modify and access the fields, and both fields would
>> need to protected by the same struct srcu_struct.  I'm not sure whether
>> that's a desirable approach.  When I started to implement it, it got
>> ugly pretty quickly.  What do others think?
>>
>> For now, my preference is to get the full patch set in.  I will continue
>> to investigate the performance impact of the data structure size changes
>> that I've been seeing.
>>
>> So, for the four patches:
>>
>> Acked-by: Jeff Moyer <jmoyer@redhat.com>
>>
>> Jens, can you have a look at the patch set?  We are seeing problem
>> reports of this in the wild[1][2].
> 
> I'll queue it up for 3.7. I can run my regular testing on the 8-way, it
> has a nack for showing scaling problems very nicely in aio/dio. As long
> as we're not adding per-inode cache line dirtying per IO (and the
> per-cpu rw sem looks OK), then I don't think there's too much to worry
> about.

I take that back. The series doesn't apply to my current tree. Not too
unexpected, since it's some weeks old. But more importantly, please send
this is a "real" patch series. I don't want to see two implementations
of rw semaphores. I think it's perfectly fine to first do a regular rw
sem, then a last patch adding the cache friendly variant from Eric and
converting to that.

In other words, get rid of 3/4.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 1/2] Fix a crash when block device is read and block size is changed at the same time
  2012-09-25 18:11                                         ` Jens Axboe
@ 2012-09-25 22:49                                           ` Mikulas Patocka
  2012-09-26  5:48                                             ` Jens Axboe
  2012-11-16 22:02                                             ` Jeff Moyer
  2012-09-25 22:50                                           ` [PATCH 2/2] " Mikulas Patocka
  1 sibling, 2 replies; 45+ messages in thread
From: Mikulas Patocka @ 2012-09-25 22:49 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jeff Moyer, Eric Dumazet, Andrea Arcangeli, Jan Kara, dm-devel,
	linux-kernel, Alexander Viro, kosaki.motohiro, linux-fsdevel,
	lwoodman, Alasdair G. Kergon



On Tue, 25 Sep 2012, Jens Axboe wrote:

> On 2012-09-25 19:59, Jens Axboe wrote:
> > On 2012-09-25 19:49, Jeff Moyer wrote:
> >> Jeff Moyer <jmoyer@redhat.com> writes:
> >>
> >>> Mikulas Patocka <mpatocka@redhat.com> writes:
> >>>
> >>>> Hi Jeff
> >>>>
> >>>> Thanks for testing.
> >>>>
> >>>> It would be interesting ... what happens if you take the patch 3, leave 
> >>>> "struct percpu_rw_semaphore bd_block_size_semaphore" in "struct 
> >>>> block_device", but remove any use of the semaphore from fs/block_dev.c? - 
> >>>> will the performance be like unpatched kernel or like patch 3? It could be 
> >>>> that the change in the alignment affects performance on your CPU too, just 
> >>>> differently than on my CPU.
> >>>
> >>> It turns out to be exactly the same performance as with the 3rd patch
> >>> applied, so I guess it does have something to do with cache alignment.
> >>> Here is the patch (against vanilla) I ended up testing.  Let me know if
> >>> I've botched it somehow.
> >>>
> >>> So, I next up I'll play similar tricks to what you did (padding struct
> >>> block_device in all kernels) to eliminate the differences due to
> >>> structure alignment and provide a clear picture of what the locking
> >>> effects are.
> >>
> >> After trying again with the same padding you used in the struct
> >> bdev_inode, I see no performance differences between any of the
> >> patches.  I tried bumping up the number of threads to saturate the
> >> number of cpus on a single NUMA node on my hardware, but that resulted
> >> in lower IOPS to the device, and hence consumption of less CPU time.
> >> So, I believe my results to be inconclusive.
> >>
> >> After talking with Vivek about the problem, he had mentioned that it
> >> might be worth investigating whether bd_block_size could be protected
> >> using SRCU.  I looked into it, and the one thing I couldn't reconcile is
> >> updating both the bd_block_size and the inode->i_blkbits at the same
> >> time.  It would involve (afaiui) adding fields to both the inode and the
> >> block_device data structures and using rcu_assign_pointer  and
> >> rcu_dereference to modify and access the fields, and both fields would
> >> need to protected by the same struct srcu_struct.  I'm not sure whether
> >> that's a desirable approach.  When I started to implement it, it got
> >> ugly pretty quickly.  What do others think?
> >>
> >> For now, my preference is to get the full patch set in.  I will continue
> >> to investigate the performance impact of the data structure size changes
> >> that I've been seeing.
> >>
> >> So, for the four patches:
> >>
> >> Acked-by: Jeff Moyer <jmoyer@redhat.com>
> >>
> >> Jens, can you have a look at the patch set?  We are seeing problem
> >> reports of this in the wild[1][2].
> > 
> > I'll queue it up for 3.7. I can run my regular testing on the 8-way, it
> > has a nack for showing scaling problems very nicely in aio/dio. As long
> > as we're not adding per-inode cache line dirtying per IO (and the
> > per-cpu rw sem looks OK), then I don't think there's too much to worry
> > about.
> 
> I take that back. The series doesn't apply to my current tree. Not too
> unexpected, since it's some weeks old. But more importantly, please send
> this is a "real" patch series. I don't want to see two implementations
> of rw semaphores. I think it's perfectly fine to first do a regular rw
> sem, then a last patch adding the cache friendly variant from Eric and
> converting to that.
> 
> In other words, get rid of 3/4.
> 
> -- 
> Jens Axboe

Hi Jens

Here I'm resending it as two patches. The first one uses existing 
semaphore, the second converts it to RCU-based percpu semaphore.

Mikulas

---

blockdev: fix a crash when block size is changed and I/O is issued simultaneously

The kernel may crash when block size is changed and I/O is issued
simultaneously.

Because some subsystems (udev or lvm) may read any block device anytime,
the bug actually puts any code that changes a block device size in
jeopardy.

The crash can be reproduced if you place "msleep(1000)" to
blkdev_get_blocks just before "bh->b_size = max_blocks <<
inode->i_blkbits;".
Then, run "dd if=/dev/ram0 of=/dev/null bs=4k count=1 iflag=direct"
While it is waiting in msleep, run "blockdev --setbsz 2048 /dev/ram0"
You get a BUG.

The direct and non-direct I/O is written with the assumption that block
size does not change. It doesn't seem practical to fix these crashes
one-by-one there may be many crash possibilities when block size changes
at a certain place and it is impossible to find them all and verify the
code.

This patch introduces a new rw-lock bd_block_size_semaphore. The lock is
taken for read during I/O. It is taken for write when changing block
size. Consequently, block size can't be changed while I/O is being
submitted.

For asynchronous I/O, the patch only prevents block size change while
the I/O is being submitted. The block size can change when the I/O is in
progress or when the I/O is being finished. This is acceptable because
there are no accesses to block size when asynchronous I/O is being
finished.

The patch prevents block size changing while the device is mapped with
mmap.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
---
 drivers/char/raw.c |    2 -
 fs/block_dev.c     |   62 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 include/linux/fs.h |    4 +++
 3 files changed, 65 insertions(+), 3 deletions(-)

Index: linux-2.6-copy/include/linux/fs.h
===================================================================
--- linux-2.6-copy.orig/include/linux/fs.h	2012-09-03 15:55:47.000000000 +0200
+++ linux-2.6-copy/include/linux/fs.h	2012-09-26 00:41:07.000000000 +0200
@@ -724,6 +724,8 @@ struct block_device {
 	int			bd_fsfreeze_count;
 	/* Mutex for freeze */
 	struct mutex		bd_fsfreeze_mutex;
+	/* A semaphore that prevents I/O while block size is being changed */
+	struct rw_semaphore	bd_block_size_semaphore;
 };
 
 /*
@@ -2564,6 +2566,8 @@ extern int generic_segment_checks(const 
 		unsigned long *nr_segs, size_t *count, int access_flags);
 
 /* fs/block_dev.c */
+extern ssize_t blkdev_aio_read(struct kiocb *iocb, const struct iovec *iov,
+			       unsigned long nr_segs, loff_t pos);
 extern ssize_t blkdev_aio_write(struct kiocb *iocb, const struct iovec *iov,
 				unsigned long nr_segs, loff_t pos);
 extern int blkdev_fsync(struct file *filp, loff_t start, loff_t end,
Index: linux-2.6-copy/fs/block_dev.c
===================================================================
--- linux-2.6-copy.orig/fs/block_dev.c	2012-09-03 15:55:44.000000000 +0200
+++ linux-2.6-copy/fs/block_dev.c	2012-09-26 00:42:49.000000000 +0200
@@ -116,6 +116,8 @@ EXPORT_SYMBOL(invalidate_bdev);
 
 int set_blocksize(struct block_device *bdev, int size)
 {
+	struct address_space *mapping;
+
 	/* Size must be a power of two, and between 512 and PAGE_SIZE */
 	if (size > PAGE_SIZE || size < 512 || !is_power_of_2(size))
 		return -EINVAL;
@@ -124,6 +126,20 @@ int set_blocksize(struct block_device *b
 	if (size < bdev_logical_block_size(bdev))
 		return -EINVAL;
 
+	/* Prevent starting I/O or mapping the device */
+	down_write(&bdev->bd_block_size_semaphore);
+
+	/* Check that the block device is not memory mapped */
+	mapping = bdev->bd_inode->i_mapping;
+	mutex_lock(&mapping->i_mmap_mutex);
+	if (!prio_tree_empty(&mapping->i_mmap) ||
+	    !list_empty(&mapping->i_mmap_nonlinear)) {
+		mutex_unlock(&mapping->i_mmap_mutex);
+		up_write(&bdev->bd_block_size_semaphore);
+		return -EBUSY;
+	}
+	mutex_unlock(&mapping->i_mmap_mutex);
+
 	/* Don't change the size if it is same as current */
 	if (bdev->bd_block_size != size) {
 		sync_blockdev(bdev);
@@ -131,6 +147,9 @@ int set_blocksize(struct block_device *b
 		bdev->bd_inode->i_blkbits = blksize_bits(size);
 		kill_bdev(bdev);
 	}
+
+	up_write(&bdev->bd_block_size_semaphore);
+
 	return 0;
 }
 
@@ -472,6 +491,7 @@ static void init_once(void *foo)
 	inode_init_once(&ei->vfs_inode);
 	/* Initialize mutex for freeze. */
 	mutex_init(&bdev->bd_fsfreeze_mutex);
+	init_rwsem(&bdev->bd_block_size_semaphore);
 }
 
 static inline void __bd_forget(struct inode *inode)
@@ -1567,6 +1587,22 @@ static long block_ioctl(struct file *fil
 	return blkdev_ioctl(bdev, mode, cmd, arg);
 }
 
+ssize_t blkdev_aio_read(struct kiocb *iocb, const struct iovec *iov,
+			unsigned long nr_segs, loff_t pos)
+{
+	ssize_t ret;
+	struct block_device *bdev = I_BDEV(iocb->ki_filp->f_mapping->host);
+
+	down_read(&bdev->bd_block_size_semaphore);
+
+	ret = generic_file_aio_read(iocb, iov, nr_segs, pos);
+
+	up_read(&bdev->bd_block_size_semaphore);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(blkdev_aio_read);
+
 /*
  * Write data to the block device.  Only intended for the block device itself
  * and the raw driver which basically is a fake block device.
@@ -1578,12 +1614,16 @@ ssize_t blkdev_aio_write(struct kiocb *i
 			 unsigned long nr_segs, loff_t pos)
 {
 	struct file *file = iocb->ki_filp;
+	struct block_device *bdev = I_BDEV(file->f_mapping->host);
 	struct blk_plug plug;
 	ssize_t ret;
 
 	BUG_ON(iocb->ki_pos != pos);
 
 	blk_start_plug(&plug);
+
+	down_read(&bdev->bd_block_size_semaphore);
+
 	ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
 	if (ret > 0 || ret == -EIOCBQUEUED) {
 		ssize_t err;
@@ -1592,11 +1632,29 @@ ssize_t blkdev_aio_write(struct kiocb *i
 		if (err < 0 && ret > 0)
 			ret = err;
 	}
+
+	up_read(&bdev->bd_block_size_semaphore);
+
 	blk_finish_plug(&plug);
+
 	return ret;
 }
 EXPORT_SYMBOL_GPL(blkdev_aio_write);
 
+int blkdev_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	int ret;
+	struct block_device *bdev = I_BDEV(file->f_mapping->host);
+
+	down_read(&bdev->bd_block_size_semaphore);
+
+	ret = generic_file_mmap(file, vma);
+
+	up_read(&bdev->bd_block_size_semaphore);
+
+	return ret;
+}
+
 /*
  * Try to release a page associated with block device when the system
  * is under memory pressure.
@@ -1627,9 +1685,9 @@ const struct file_operations def_blk_fop
 	.llseek		= block_llseek,
 	.read		= do_sync_read,
 	.write		= do_sync_write,
-  	.aio_read	= generic_file_aio_read,
+  	.aio_read	= blkdev_aio_read,
 	.aio_write	= blkdev_aio_write,
-	.mmap		= generic_file_mmap,
+	.mmap		= blkdev_mmap,
 	.fsync		= blkdev_fsync,
 	.unlocked_ioctl	= block_ioctl,
 #ifdef CONFIG_COMPAT
Index: linux-2.6-copy/drivers/char/raw.c
===================================================================
--- linux-2.6-copy.orig/drivers/char/raw.c	2012-09-01 00:14:45.000000000 +0200
+++ linux-2.6-copy/drivers/char/raw.c	2012-09-26 00:41:07.000000000 +0200
@@ -285,7 +285,7 @@ static long raw_ctl_compat_ioctl(struct 
 
 static const struct file_operations raw_fops = {
 	.read		= do_sync_read,
-	.aio_read	= generic_file_aio_read,
+	.aio_read	= blkdev_aio_read,
 	.write		= do_sync_write,
 	.aio_write	= blkdev_aio_write,
 	.fsync		= blkdev_fsync,

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 2/2] Fix a crash when block device is read and block size is changed at the same time
  2012-09-25 18:11                                         ` Jens Axboe
  2012-09-25 22:49                                           ` [PATCH 1/2] " Mikulas Patocka
@ 2012-09-25 22:50                                           ` Mikulas Patocka
  1 sibling, 0 replies; 45+ messages in thread
From: Mikulas Patocka @ 2012-09-25 22:50 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jeff Moyer, Eric Dumazet, Andrea Arcangeli, Jan Kara, dm-devel,
	linux-kernel, Alexander Viro, kosaki.motohiro, linux-fsdevel,
	lwoodman, Alasdair G. Kergon

blockdev: turn a rw semaphore into a percpu rw semaphore

This avoids cache line bouncing when many processes lock the semaphore
for read.

New percpu lock implementation

The lock consists of an array of percpu unsigned integers, a boolean
variable and a mutex.

When we take the lock for read, we enter rcu read section, check for a
"locked" variable. If it is false, we increase a percpu counter on the
current cpu and exit the rcu section. If "locked" is true, we exit the
rcu section, take the mutex and drop it (this waits until a writer
finished) and retry.

Unlocking for read just decreases percpu variable. Note that we can
unlock on a difference cpu than where we locked, in this case the
counter underflows. The sum of all percpu counters represents the number
of processes that hold the lock for read.

When we need to lock for write, we take the mutex, set "locked" variable
to true and synchronize rcu. Since RCU has been synchronized, no
processes can create new read locks. We wait until the sum of percpu
counters is zero - when it is, there are no readers in the critical
section.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 Documentation/percpu-rw-semaphore.txt |   27 ++++++++++
 fs/block_dev.c                        |   27 ++++++----
 include/linux/fs.h                    |    3 -
 include/linux/percpu-rwsem.h          |   89 ++++++++++++++++++++++++++++++++++
 4 files changed, 135 insertions(+), 11 deletions(-)

---

Index: linux-2.6-copy/fs/block_dev.c
===================================================================
--- linux-2.6-copy.orig/fs/block_dev.c	2012-09-26 00:42:49.000000000 +0200
+++ linux-2.6-copy/fs/block_dev.c	2012-09-26 00:45:29.000000000 +0200
@@ -127,7 +127,7 @@ int set_blocksize(struct block_device *b
 		return -EINVAL;
 
 	/* Prevent starting I/O or mapping the device */
-	down_write(&bdev->bd_block_size_semaphore);
+	percpu_down_write(&bdev->bd_block_size_semaphore);
 
 	/* Check that the block device is not memory mapped */
 	mapping = bdev->bd_inode->i_mapping;
@@ -135,7 +135,7 @@ int set_blocksize(struct block_device *b
 	if (!prio_tree_empty(&mapping->i_mmap) ||
 	    !list_empty(&mapping->i_mmap_nonlinear)) {
 		mutex_unlock(&mapping->i_mmap_mutex);
-		up_write(&bdev->bd_block_size_semaphore);
+		percpu_up_write(&bdev->bd_block_size_semaphore);
 		return -EBUSY;
 	}
 	mutex_unlock(&mapping->i_mmap_mutex);
@@ -148,7 +148,7 @@ int set_blocksize(struct block_device *b
 		kill_bdev(bdev);
 	}
 
-	up_write(&bdev->bd_block_size_semaphore);
+	percpu_up_write(&bdev->bd_block_size_semaphore);
 
 	return 0;
 }
@@ -460,6 +460,12 @@ static struct inode *bdev_alloc_inode(st
 	struct bdev_inode *ei = kmem_cache_alloc(bdev_cachep, GFP_KERNEL);
 	if (!ei)
 		return NULL;
+
+	if (unlikely(percpu_init_rwsem(&ei->bdev.bd_block_size_semaphore))) {
+		kmem_cache_free(bdev_cachep, ei);
+		return NULL;
+	}
+
 	return &ei->vfs_inode;
 }
 
@@ -468,6 +474,8 @@ static void bdev_i_callback(struct rcu_h
 	struct inode *inode = container_of(head, struct inode, i_rcu);
 	struct bdev_inode *bdi = BDEV_I(inode);
 
+	percpu_free_rwsem(&bdi->bdev.bd_block_size_semaphore);
+
 	kmem_cache_free(bdev_cachep, bdi);
 }
 
@@ -491,7 +499,6 @@ static void init_once(void *foo)
 	inode_init_once(&ei->vfs_inode);
 	/* Initialize mutex for freeze. */
 	mutex_init(&bdev->bd_fsfreeze_mutex);
-	init_rwsem(&bdev->bd_block_size_semaphore);
 }
 
 static inline void __bd_forget(struct inode *inode)
@@ -1593,11 +1600,11 @@ ssize_t blkdev_aio_read(struct kiocb *io
 	ssize_t ret;
 	struct block_device *bdev = I_BDEV(iocb->ki_filp->f_mapping->host);
 
-	down_read(&bdev->bd_block_size_semaphore);
+	percpu_down_read(&bdev->bd_block_size_semaphore);
 
 	ret = generic_file_aio_read(iocb, iov, nr_segs, pos);
 
-	up_read(&bdev->bd_block_size_semaphore);
+	percpu_up_read(&bdev->bd_block_size_semaphore);
 
 	return ret;
 }
@@ -1622,7 +1629,7 @@ ssize_t blkdev_aio_write(struct kiocb *i
 
 	blk_start_plug(&plug);
 
-	down_read(&bdev->bd_block_size_semaphore);
+	percpu_down_read(&bdev->bd_block_size_semaphore);
 
 	ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
 	if (ret > 0 || ret == -EIOCBQUEUED) {
@@ -1633,7 +1640,7 @@ ssize_t blkdev_aio_write(struct kiocb *i
 			ret = err;
 	}
 
-	up_read(&bdev->bd_block_size_semaphore);
+	percpu_up_read(&bdev->bd_block_size_semaphore);
 
 	blk_finish_plug(&plug);
 
@@ -1646,11 +1653,11 @@ int blkdev_mmap(struct file *file, struc
 	int ret;
 	struct block_device *bdev = I_BDEV(file->f_mapping->host);
 
-	down_read(&bdev->bd_block_size_semaphore);
+	percpu_down_read(&bdev->bd_block_size_semaphore);
 
 	ret = generic_file_mmap(file, vma);
 
-	up_read(&bdev->bd_block_size_semaphore);
+	percpu_up_read(&bdev->bd_block_size_semaphore);
 
 	return ret;
 }
Index: linux-2.6-copy/include/linux/fs.h
===================================================================
--- linux-2.6-copy.orig/include/linux/fs.h	2012-09-26 00:41:07.000000000 +0200
+++ linux-2.6-copy/include/linux/fs.h	2012-09-26 00:45:29.000000000 +0200
@@ -10,6 +10,7 @@
 #include <linux/ioctl.h>
 #include <linux/blk_types.h>
 #include <linux/types.h>
+#include <linux/percpu-rwsem.h>
 
 /*
  * It's silly to have NR_OPEN bigger than NR_FILE, but you can change
@@ -725,7 +726,7 @@ struct block_device {
 	/* Mutex for freeze */
 	struct mutex		bd_fsfreeze_mutex;
 	/* A semaphore that prevents I/O while block size is being changed */
-	struct rw_semaphore	bd_block_size_semaphore;
+	struct percpu_rw_semaphore	bd_block_size_semaphore;
 };
 
 /*
Index: linux-2.6-copy/include/linux/percpu-rwsem.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-copy/include/linux/percpu-rwsem.h	2012-09-26 00:45:29.000000000 +0200
@@ -0,0 +1,89 @@
+#ifndef _LINUX_PERCPU_RWSEM_H
+#define _LINUX_PERCPU_RWSEM_H
+
+#include <linux/mutex.h>
+#include <linux/percpu.h>
+#include <linux/rcupdate.h>
+#include <linux/delay.h>
+
+struct percpu_rw_semaphore {
+	unsigned __percpu *counters;
+	bool locked;
+	struct mutex mtx;
+};
+
+static inline void percpu_down_read(struct percpu_rw_semaphore *p)
+{
+	rcu_read_lock();
+	if (unlikely(p->locked)) {
+		rcu_read_unlock();
+		mutex_lock(&p->mtx);
+		this_cpu_inc(*p->counters);
+		mutex_unlock(&p->mtx);
+		return;
+	}
+	this_cpu_inc(*p->counters);
+	rcu_read_unlock();
+}
+
+static inline void percpu_up_read(struct percpu_rw_semaphore *p)
+{
+	/*
+	 * On X86, write operation in this_cpu_dec serves as a memory unlock
+	 * barrier (i.e. memory accesses may be moved before the write, but
+	 * no memory accesses are moved past the write).
+	 * On other architectures this may not be the case, so we need smp_mb()
+	 * there.
+	 */
+#if defined(CONFIG_X86) && (!defined(CONFIG_X86_PPRO_FENCE) && !defined(CONFIG_X86_OOSTORE))
+	barrier();
+#else
+	smp_mb();
+#endif
+	this_cpu_dec(*p->counters);
+}
+
+static inline unsigned __percpu_count(unsigned __percpu *counters)
+{
+	unsigned total = 0;
+	int cpu;
+
+	for_each_possible_cpu(cpu)
+		total += ACCESS_ONCE(*per_cpu_ptr(counters, cpu));
+
+	return total;
+}
+
+static inline void percpu_down_write(struct percpu_rw_semaphore *p)
+{
+	mutex_lock(&p->mtx);
+	p->locked = true;
+	synchronize_rcu();
+	while (__percpu_count(p->counters))
+		msleep(1);
+	smp_rmb(); /* paired with smp_mb() in percpu_sem_up_read() */
+}
+
+static inline void percpu_up_write(struct percpu_rw_semaphore *p)
+{
+	p->locked = false;
+	mutex_unlock(&p->mtx);
+}
+
+static inline int percpu_init_rwsem(struct percpu_rw_semaphore *p)
+{
+	p->counters = alloc_percpu(unsigned);
+	if (unlikely(!p->counters))
+		return -ENOMEM;
+	p->locked = false;
+	mutex_init(&p->mtx);
+	return 0;
+}
+
+static inline void percpu_free_rwsem(struct percpu_rw_semaphore *p)
+{
+	free_percpu(p->counters);
+	p->counters = NULL; /* catch use after free bugs */
+}
+
+#endif
Index: linux-2.6-copy/Documentation/percpu-rw-semaphore.txt
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-copy/Documentation/percpu-rw-semaphore.txt	2012-09-26 00:45:29.000000000 +0200
@@ -0,0 +1,27 @@
+Percpu rw semaphores
+--------------------
+
+Percpu rw semaphores is a new read-write semaphore design that is
+optimized for locking for reading.
+
+The problem with traditional read-write semaphores is that when multiple
+cores take the lock for reading, the cache line containing the semaphore
+is bouncing between L1 caches of the cores, causing performance
+degradation.
+
+Locking for reading it very fast, it uses RCU and it avoids any atomic
+instruction in the lock and unlock path. On the other hand, locking for
+writing is very expensive, it calls synchronize_rcu() that can take
+hundreds of microseconds.
+
+The lock is declared with "struct percpu_rw_semaphore" type.
+The lock is initialized percpu_init_rwsem, it returns 0 on success and
+-ENOMEM on allocation failure.
+The lock must be freed with percpu_free_rwsem to avoid memory leak.
+
+The lock is locked for read with percpu_down_read, percpu_up_read and
+for write with percpu_down_write, percpu_up_write.
+
+The idea of using RCU for optimized rw-lock was introduced by
+Eric Dumazet <eric.dumazet@gmail.com>.
+The code was written by Mikulas Patocka <mpatocka@redhat.com>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 0/4] Fix a crash when block device is read and block size is changed at the same time
  2012-09-25 17:49                                     ` Jeff Moyer
  2012-09-25 17:59                                       ` Jens Axboe
@ 2012-09-25 22:58                                       ` Mikulas Patocka
  2012-09-26 13:47                                         ` Jeff Moyer
  1 sibling, 1 reply; 45+ messages in thread
From: Mikulas Patocka @ 2012-09-25 22:58 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Eric Dumazet, Jens Axboe, Andrea Arcangeli, Jan Kara, dm-devel,
	linux-kernel, Alexander Viro, kosaki.motohiro, linux-fsdevel,
	lwoodman, Alasdair G. Kergon



On Tue, 25 Sep 2012, Jeff Moyer wrote:

> Jeff Moyer <jmoyer@redhat.com> writes:
> 
> > Mikulas Patocka <mpatocka@redhat.com> writes:
> >
> >> Hi Jeff
> >>
> >> Thanks for testing.
> >>
> >> It would be interesting ... what happens if you take the patch 3, leave 
> >> "struct percpu_rw_semaphore bd_block_size_semaphore" in "struct 
> >> block_device", but remove any use of the semaphore from fs/block_dev.c? - 
> >> will the performance be like unpatched kernel or like patch 3? It could be 
> >> that the change in the alignment affects performance on your CPU too, just 
> >> differently than on my CPU.
> >
> > It turns out to be exactly the same performance as with the 3rd patch
> > applied, so I guess it does have something to do with cache alignment.
> > Here is the patch (against vanilla) I ended up testing.  Let me know if
> > I've botched it somehow.
> >
> > So, I next up I'll play similar tricks to what you did (padding struct
> > block_device in all kernels) to eliminate the differences due to
> > structure alignment and provide a clear picture of what the locking
> > effects are.
> 
> After trying again with the same padding you used in the struct
> bdev_inode, I see no performance differences between any of the
> patches.  I tried bumping up the number of threads to saturate the
> number of cpus on a single NUMA node on my hardware, but that resulted
> in lower IOPS to the device, and hence consumption of less CPU time.
> So, I believe my results to be inconclusive.

For me, the fourth patch with RCU-based locks performed better, so I am 
submitting that.

> After talking with Vivek about the problem, he had mentioned that it
> might be worth investigating whether bd_block_size could be protected
> using SRCU.  I looked into it, and the one thing I couldn't reconcile is
> updating both the bd_block_size and the inode->i_blkbits at the same
> time.  It would involve (afaiui) adding fields to both the inode and the
> block_device data structures and using rcu_assign_pointer  and
> rcu_dereference to modify and access the fields, and both fields would
> need to protected by the same struct srcu_struct.  I'm not sure whether
> that's a desirable approach.  When I started to implement it, it got
> ugly pretty quickly.  What do others think?

Using RCU doesn't seem sensible to me (except for lock implementation, as 
it is in patch 4). The major problem is that the block layer reads 
blocksize multiple times and when different values are read, a crash may 
happen - RCU doesn't protect you against that - if you read a variable 
multiple times in a RCU-protected section, you can still get different 
results.

If we wanted to use RCU, we would have to read blocksize just once and 
pass the value between all functions involved - that would result in a 
massive code change.

> For now, my preference is to get the full patch set in.  I will continue
> to investigate the performance impact of the data structure size changes
> that I've been seeing.

Yes, we should get the patches to the kernel.

Mikulas

> So, for the four patches:
> 
> Acked-by: Jeff Moyer <jmoyer@redhat.com>
> 
> Jens, can you have a look at the patch set?  We are seeing problem
> reports of this in the wild[1][2].
> 
> Cheers,
> Jeff
> 
> [1] https://bugzilla.redhat.com/show_bug.cgi?id=824107
> [2] https://bugzilla.redhat.com/show_bug.cgi?id=812129
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/2] Fix a crash when block device is read and block size is changed at the same time
  2012-09-25 22:49                                           ` [PATCH 1/2] " Mikulas Patocka
@ 2012-09-26  5:48                                             ` Jens Axboe
  2012-11-16 22:02                                             ` Jeff Moyer
  1 sibling, 0 replies; 45+ messages in thread
From: Jens Axboe @ 2012-09-26  5:48 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Jeff Moyer, Eric Dumazet, Andrea Arcangeli, Jan Kara, dm-devel,
	linux-kernel, Alexander Viro, kosaki.motohiro, linux-fsdevel,
	lwoodman, Alasdair G. Kergon

On 2012-09-26 00:49, Mikulas Patocka wrote:
> Here I'm resending it as two patches. The first one uses existing 
> semaphore, the second converts it to RCU-based percpu semaphore.

Thanks, applied. In the future, please send new patch 'series' as a new
thread instead of replying to some email in the middle of an existing
thread. It all becomes very messy pretty quickly. Patch #2 had a botched
subject line, so that didn't help either :-)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 0/4] Fix a crash when block device is read and block size is changed at the same time
  2012-09-25 22:58                                       ` [PATCH 0/4] " Mikulas Patocka
@ 2012-09-26 13:47                                         ` Jeff Moyer
  2012-09-26 14:35                                           ` Mikulas Patocka
  0 siblings, 1 reply; 45+ messages in thread
From: Jeff Moyer @ 2012-09-26 13:47 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Eric Dumazet, Jens Axboe, Andrea Arcangeli, Jan Kara, dm-devel,
	linux-kernel, Alexander Viro, kosaki.motohiro, linux-fsdevel,
	lwoodman, Alasdair G. Kergon

Mikulas Patocka <mpatocka@redhat.com> writes:

> On Tue, 25 Sep 2012, Jeff Moyer wrote:
>
>> Jeff Moyer <jmoyer@redhat.com> writes:
>> 
>> > Mikulas Patocka <mpatocka@redhat.com> writes:
>> >
>> >> Hi Jeff
>> >>
>> >> Thanks for testing.
>> >>
>> >> It would be interesting ... what happens if you take the patch 3, leave 
>> >> "struct percpu_rw_semaphore bd_block_size_semaphore" in "struct 
>> >> block_device", but remove any use of the semaphore from fs/block_dev.c? - 
>> >> will the performance be like unpatched kernel or like patch 3? It could be 
>> >> that the change in the alignment affects performance on your CPU too, just 
>> >> differently than on my CPU.
>> >
>> > It turns out to be exactly the same performance as with the 3rd patch
>> > applied, so I guess it does have something to do with cache alignment.
>> > Here is the patch (against vanilla) I ended up testing.  Let me know if
>> > I've botched it somehow.
>> >
>> > So, I next up I'll play similar tricks to what you did (padding struct
>> > block_device in all kernels) to eliminate the differences due to
>> > structure alignment and provide a clear picture of what the locking
>> > effects are.
>> 
>> After trying again with the same padding you used in the struct
>> bdev_inode, I see no performance differences between any of the
>> patches.  I tried bumping up the number of threads to saturate the
>> number of cpus on a single NUMA node on my hardware, but that resulted
>> in lower IOPS to the device, and hence consumption of less CPU time.
>> So, I believe my results to be inconclusive.
>
> For me, the fourth patch with RCU-based locks performed better, so I am 
> submitting that.
>
>> After talking with Vivek about the problem, he had mentioned that it
>> might be worth investigating whether bd_block_size could be protected
>> using SRCU.  I looked into it, and the one thing I couldn't reconcile is
>> updating both the bd_block_size and the inode->i_blkbits at the same
>> time.  It would involve (afaiui) adding fields to both the inode and the
>> block_device data structures and using rcu_assign_pointer  and
>> rcu_dereference to modify and access the fields, and both fields would
>> need to protected by the same struct srcu_struct.  I'm not sure whether
>> that's a desirable approach.  When I started to implement it, it got
>> ugly pretty quickly.  What do others think?
>
> Using RCU doesn't seem sensible to me (except for lock implementation, as 
> it is in patch 4). The major problem is that the block layer reads 
> blocksize multiple times and when different values are read, a crash may 
> happen - RCU doesn't protect you against that - if you read a variable 
> multiple times in a RCU-protected section, you can still get different 
> results.

SRCU is sleepable, so could be (I think) used in the same manner as your
rw semaphore.  The only difference is that it would require changing the
bd_blocksize and the i_blkbits to pointers and protecting them both with
the same srcu struct.  Then, the inode i_blkbits would also need to be
special cased, so that we only require such handling when it is
associated with a block device.  It got messy.

> If we wanted to use RCU, we would have to read blocksize just once and 
> pass the value between all functions involved - that would result in a 
> massive code change.

If we did that, we wouldn't need rcu at all, would we?

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 0/4] Fix a crash when block device is read and block size is changed at the same time
  2012-09-26 13:47                                         ` Jeff Moyer
@ 2012-09-26 14:35                                           ` Mikulas Patocka
  0 siblings, 0 replies; 45+ messages in thread
From: Mikulas Patocka @ 2012-09-26 14:35 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Eric Dumazet, Jens Axboe, Andrea Arcangeli, Jan Kara, dm-devel,
	linux-kernel, Alexander Viro, kosaki.motohiro, linux-fsdevel,
	lwoodman, Alasdair G. Kergon



On Wed, 26 Sep 2012, Jeff Moyer wrote:

> Mikulas Patocka <mpatocka@redhat.com> writes:
> 
> > On Tue, 25 Sep 2012, Jeff Moyer wrote:
> >
> >> Jeff Moyer <jmoyer@redhat.com> writes:
> >> 
> >> > Mikulas Patocka <mpatocka@redhat.com> writes:
> >> >
> >> >> Hi Jeff
> >> >>
> >> >> Thanks for testing.
> >> >>
> >> >> It would be interesting ... what happens if you take the patch 3, leave 
> >> >> "struct percpu_rw_semaphore bd_block_size_semaphore" in "struct 
> >> >> block_device", but remove any use of the semaphore from fs/block_dev.c? - 
> >> >> will the performance be like unpatched kernel or like patch 3? It could be 
> >> >> that the change in the alignment affects performance on your CPU too, just 
> >> >> differently than on my CPU.
> >> >
> >> > It turns out to be exactly the same performance as with the 3rd patch
> >> > applied, so I guess it does have something to do with cache alignment.
> >> > Here is the patch (against vanilla) I ended up testing.  Let me know if
> >> > I've botched it somehow.
> >> >
> >> > So, I next up I'll play similar tricks to what you did (padding struct
> >> > block_device in all kernels) to eliminate the differences due to
> >> > structure alignment and provide a clear picture of what the locking
> >> > effects are.
> >> 
> >> After trying again with the same padding you used in the struct
> >> bdev_inode, I see no performance differences between any of the
> >> patches.  I tried bumping up the number of threads to saturate the
> >> number of cpus on a single NUMA node on my hardware, but that resulted
> >> in lower IOPS to the device, and hence consumption of less CPU time.
> >> So, I believe my results to be inconclusive.
> >
> > For me, the fourth patch with RCU-based locks performed better, so I am 
> > submitting that.
> >
> >> After talking with Vivek about the problem, he had mentioned that it
> >> might be worth investigating whether bd_block_size could be protected
> >> using SRCU.  I looked into it, and the one thing I couldn't reconcile is
> >> updating both the bd_block_size and the inode->i_blkbits at the same
> >> time.  It would involve (afaiui) adding fields to both the inode and the
> >> block_device data structures and using rcu_assign_pointer  and
> >> rcu_dereference to modify and access the fields, and both fields would
> >> need to protected by the same struct srcu_struct.  I'm not sure whether
> >> that's a desirable approach.  When I started to implement it, it got
> >> ugly pretty quickly.  What do others think?
> >
> > Using RCU doesn't seem sensible to me (except for lock implementation, as 
> > it is in patch 4). The major problem is that the block layer reads 
> > blocksize multiple times and when different values are read, a crash may 
> > happen - RCU doesn't protect you against that - if you read a variable 
> > multiple times in a RCU-protected section, you can still get different 
> > results.
> 
> SRCU is sleepable, so could be (I think) used in the same manner as your
> rw semaphore.  The only difference is that it would require changing the
> bd_blocksize and the i_blkbits to pointers and protecting them both with
> the same srcu struct.  Then, the inode i_blkbits would also need to be
> special cased, so that we only require such handling when it is
> associated with a block device.  It got messy.

No, it couldn't be used this way.

If you do
srcu_read_lock(&srcu)
ptr1 = srcu_dereference(pointer, &srcu);
ptr2 = srcu_dereference(pointer, &srcu);
srcu_read_unlock(&srcu)

it doesn't guarantee that ptr1 == ptr2.

All that it guarantees is that when synchronize_srcu exits, there are no 
references to the old structure. But after rcu_assign_pointer and before 
synchronize_srcu exits, readers can read both old and new value of the 
pointer and it is not specified which value do they read.

> > If we wanted to use RCU, we would have to read blocksize just once and 
> > pass the value between all functions involved - that would result in a 
> > massive code change.
> 
> If we did that, we wouldn't need rcu at all, would we?

Yes, we wouldn't need RCU then.

Mikulas

> Cheers,
> Jeff
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 1/2] Fix a crash when block device is read and block size is changed at the same time
  2012-09-25 22:49                                           ` [PATCH 1/2] " Mikulas Patocka
  2012-09-26  5:48                                             ` Jens Axboe
@ 2012-11-16 22:02                                             ` Jeff Moyer
  1 sibling, 0 replies; 45+ messages in thread
From: Jeff Moyer @ 2012-11-16 22:02 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Jens Axboe, Eric Dumazet, Andrea Arcangeli, Jan Kara, dm-devel,
	linux-kernel, Alexander Viro, kosaki.motohiro, linux-fsdevel,
	lwoodman, Alasdair G. Kergon

Mikulas Patocka <mpatocka@redhat.com> writes:

> blockdev: fix a crash when block size is changed and I/O is issued simultaneously
>
> The kernel may crash when block size is changed and I/O is issued
> simultaneously.

Mikulas, do you plan to submit a variant of this for stable as well?

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2012-11-16 22:02 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-06-28  3:04 Crash when IO is being submitted and block size is changed Mikulas Patocka
2012-06-28 11:15 ` Jan Kara
2012-06-28 15:44   ` Mikulas Patocka
2012-06-28 16:53     ` Jan Kara
2012-07-16  0:55   ` Mikulas Patocka
2012-07-17 19:19     ` Jeff Moyer
2012-07-19  2:27       ` Mikulas Patocka
2012-07-19 13:33         ` Jeff Moyer
2012-07-28 16:40           ` [PATCH 1/3] Fix " Mikulas Patocka
2012-07-28 16:41             ` [PATCH 2/3] Introduce percpu rw semaphores Mikulas Patocka
2012-07-28 16:42               ` [PATCH 3/3] blockdev: turn a rw semaphore into a percpu rw semaphore Mikulas Patocka
2012-07-28 20:44               ` [PATCH 2/3] Introduce percpu rw semaphores Eric Dumazet
2012-07-29  5:13                 ` [dm-devel] " Mikulas Patocka
2012-07-29 10:10                   ` Eric Dumazet
2012-07-29 18:36                     ` Eric Dumazet
2012-08-01 20:07                       ` Mikulas Patocka
2012-08-01 20:09                       ` [PATCH 4/3] " Mikulas Patocka
2012-08-31 18:40                         ` [PATCH 0/4] Fix a crash when block device is read and block size is changed at the same time Mikulas Patocka
2012-08-31 18:41                           ` [PATCH 1/4] Add a lock that will be needed by the next patch Mikulas Patocka
2012-08-31 18:42                             ` [PATCH 2/4] blockdev: fix a crash when block size is changed and I/O is issued simultaneously Mikulas Patocka
2012-08-31 18:43                               ` [PATCH 3/4] blockdev: turn a rw semaphore into a percpu rw semaphore Mikulas Patocka
2012-08-31 18:43                                 ` [PATCH 4/4] New percpu lock implementation Mikulas Patocka
2012-08-31 19:27                           ` [PATCH 0/4] Fix a crash when block device is read and block size is changed at the same time Mikulas Patocka
2012-08-31 20:11                             ` Jeff Moyer
2012-08-31 20:34                               ` Mikulas Patocka
2012-09-17 21:19                               ` Jeff Moyer
2012-09-18 17:04                                 ` Mikulas Patocka
2012-09-18 17:22                                   ` Jeff Moyer
2012-09-18 18:55                                     ` Mikulas Patocka
2012-09-18 18:58                                       ` Jeff Moyer
2012-09-18 20:11                                   ` Jeff Moyer
2012-09-25 17:49                                     ` Jeff Moyer
2012-09-25 17:59                                       ` Jens Axboe
2012-09-25 18:11                                         ` Jens Axboe
2012-09-25 22:49                                           ` [PATCH 1/2] " Mikulas Patocka
2012-09-26  5:48                                             ` Jens Axboe
2012-11-16 22:02                                             ` Jeff Moyer
2012-09-25 22:50                                           ` [PATCH 2/2] " Mikulas Patocka
2012-09-25 22:58                                       ` [PATCH 0/4] " Mikulas Patocka
2012-09-26 13:47                                         ` Jeff Moyer
2012-09-26 14:35                                           ` Mikulas Patocka
2012-07-30 17:00                   ` [dm-devel] [PATCH 2/3] Introduce percpu rw semaphores Paul E. McKenney
2012-07-31  0:00                     ` Mikulas Patocka
2012-08-01 17:15                       ` Paul E. McKenney
2012-06-29  6:25 ` Crash when IO is being submitted and block size is changed Vyacheslav Dubeyko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).