linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH] block: Invalidate the cache for a parent block-device
@ 2012-01-20  1:58 Mikulas Patocka
  2012-01-20  9:35 ` Niels de Vos
  0 siblings, 1 reply; 23+ messages in thread
From: Mikulas Patocka @ 2012-01-20  1:58 UTC (permalink / raw)
  To: ndevos; +Cc: linux-kernel, bmr

Hi

This definitely doesn't belong to blkdev_issue_flush.

blkdev_issue_flush is called by filesystems each time they want to 
synchronize hardware disk cache (for example, when committing a journal).

The patch may cause serious performance regressions (each time a 
filesystem commits journal, the patch causes it to walk all buffers in the 
whole-disk device).

You should put this code into fsync_bdev (so that it is called only on 
fsync or BLKFLSBUF) and not to blkdev_issue_flush.

Mikulas

> Executing a BLKFLSBUF-ioctl on a partition flushes the caches for that
> partition but reading data through the parent device will still return
> the old cached data.
> 
> The cache for the block-device is not synced if the block-device is kept
> open (due to a mounted partition, for example). Only when all users for
> the disk have exited, the cache for the disk is made consistent again.
> 
> Calling invalidate_bdev() on the parent block-device in case
> blkdev_issue_flush() was called for a partition fixes this.
> 
> The problem can be worked around by forcing the caches to be flushed
> with either
>         # blockdev --flushbufs ${dev_disk}
> or
>         # echo 3 > /proc/sys/vm/drop_caches
> 
> CC: Bryn M. Reeves <bmr <at> redhat.com>
> Signed-off-by: Niels de Vos <ndevos <at> redhat.com>
> ---
>  block/blk-flush.c |    3 +++
>  1 files changed, 3 insertions(+), 0 deletions(-)
> 
> diff --git a/block/blk-flush.c b/block/blk-flush.c
> index 720ad60..e876f8e 100644
> --- a/block/blk-flush.c
> +++ b/block/blk-flush.c
> @@ -448,6 +448,9 @@ int blkdev_issue_flush(struct block_device *bdev, gfp_t gfp_
> 
>         if (!bio_flagged(bio, BIO_UPTODATE))
>                 ret = -EIO;
> +       else if (bdev != bdev->bd_contains)
> +               /* invalidate parent block_device */
> +               invalidate_bdev(bdev->bd_contains);
> 
>         bio_put(bio);
>         return ret;
> --
> 1.7.6.5
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] block: Invalidate the cache for a parent block-device
  2012-01-20  1:58 [PATCH] block: Invalidate the cache for a parent block-device Mikulas Patocka
@ 2012-01-20  9:35 ` Niels de Vos
  2012-01-23 10:38   ` [PATCH v2] fs: Invalidate the cache for a parent block-device if fsync() is called for a partition Niels de Vos
  0 siblings, 1 reply; 23+ messages in thread
From: Niels de Vos @ 2012-01-20  9:35 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: linux-kernel, Bryn M. Reeves

On 01/20/2012 01:58 AM, Mikulas Patocka wrote:
> This definitely doesn't belong to blkdev_issue_flush.
> 
> blkdev_issue_flush is called by filesystems each time they want to 
> synchronize hardware disk cache (for example, when committing a journal).

Ay, that sounds bad.

> The patch may cause serious performance regressions (each time a 
> filesystem commits journal, the patch causes it to walk all buffers in the 
> whole-disk device).
> 
> You should put this code into fsync_bdev (so that it is called only on 
> fsync or BLKFLSBUF) and not to blkdev_issue_flush.

Hmm, I actually thought about that earlier, but decided that
blkdev_issue_flush() was nicer. Looking at it again, I'm not sure why I
thought that.

I'll post a 2nd version soon.

Thanks,
Niels


> Mikulas
> 
>> Executing a BLKFLSBUF-ioctl on a partition flushes the caches for that
>> partition but reading data through the parent device will still return
>> the old cached data.
>>
>> The cache for the block-device is not synced if the block-device is kept
>> open (due to a mounted partition, for example). Only when all users for
>> the disk have exited, the cache for the disk is made consistent again.
>>
>> Calling invalidate_bdev() on the parent block-device in case
>> blkdev_issue_flush() was called for a partition fixes this.
>>
>> The problem can be worked around by forcing the caches to be flushed
>> with either
>>         # blockdev --flushbufs ${dev_disk}
>> or
>>         # echo 3 > /proc/sys/vm/drop_caches
>>
>> CC: Bryn M. Reeves <bmr <at> redhat.com>
>> Signed-off-by: Niels de Vos <ndevos <at> redhat.com>
>> ---
>>  block/blk-flush.c |    3 +++
>>  1 files changed, 3 insertions(+), 0 deletions(-)
>>
>> diff --git a/block/blk-flush.c b/block/blk-flush.c
>> index 720ad60..e876f8e 100644
>> --- a/block/blk-flush.c
>> +++ b/block/blk-flush.c
>> @@ -448,6 +448,9 @@ int blkdev_issue_flush(struct block_device *bdev, gfp_t gfp_
>>
>>         if (!bio_flagged(bio, BIO_UPTODATE))
>>                 ret = -EIO;
>> +       else if (bdev != bdev->bd_contains)
>> +               /* invalidate parent block_device */
>> +               invalidate_bdev(bdev->bd_contains);
>>
>>         bio_put(bio);
>>         return ret;
>> --
>> 1.7.6.5
>>

-- 
Niels de Vos
Software Maintenance Engineer
Support Engineering Group
Red Hat UK, Ltd.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v2] fs: Invalidate the cache for a parent block-device if fsync() is called for a partition
  2012-01-20  9:35 ` Niels de Vos
@ 2012-01-23 10:38   ` Niels de Vos
  2012-01-23 16:27     ` Jeff Moyer
                       ` (3 more replies)
  0 siblings, 4 replies; 23+ messages in thread
From: Niels de Vos @ 2012-01-23 10:38 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-kernel, Niels de Vos, Bryn M. Reeves, Mikulas Patocka

Executing an fsync() on a file-descriptor of a partition flushes the
caches for that partition by calling blkdev_issue_flush(). However, it
seems that reading data through the parent device will still return the
old cached data.

The cache for the block-device is not synced if the block-device is kept
open (due to a mounted partition, for example). Only when all users for
the disk have exited, the cache for the disk is made consistent again.

Calling invalidate_bdev() on the parent block-device in case
blkdev_fsync() was called for a partition, fixes this.

The problem can be worked around by forcing the caches to be flushed
with either
	# blockdev --flushbufs ${dev_disk}
or
	# echo 3 > /proc/sys/vm/drop_caches

CC: Bryn M. Reeves <bmr@redhat.com>
CC: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Niels de Vos <ndevos@redhat.com>

---
v2:
- Do not call invalidate_bdev() from blkdev_issue_flush() and prevent
  performance degration with journalled filesystems.

  Suggested was to call invalidate_bdev() in fsync_bdev(), but this is
  not in the call-path of mkfs.ext3 and similar tools. Hence the issue
  persists.

- Correct phrasing a little, changing ioctl-BLKFLSBUF is not required.

- This issue also occurs when doing an ioctl-BLKFLSBUF on a partition.
  Reading the whole disk will still return cached data. If this is an
  issue, it will need a seperate patch.
---
 fs/block_dev.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 0e575d1..433c4de 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -424,6 +424,10 @@ int blkdev_fsync(struct file *filp, loff_t start, loff_t end, int datasync)
 	if (error == -EOPNOTSUPP)
 		error = 0;
 
+	/* invalidate parent block_device */
+	if (!error && bdev != bdev->bd_contains)
+		invalidate_bdev(bdev->bd_contains);
+
 	return error;
 }
 EXPORT_SYMBOL(blkdev_fsync);
-- 
1.7.6.5


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v2] fs: Invalidate the cache for a parent block-device if fsync() is called for a partition
  2012-01-23 10:38   ` [PATCH v2] fs: Invalidate the cache for a parent block-device if fsync() is called for a partition Niels de Vos
@ 2012-01-23 16:27     ` Jeff Moyer
  2012-01-23 16:46       ` Niels de Vos
  2012-01-23 19:23     ` Mikulas Patocka
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 23+ messages in thread
From: Jeff Moyer @ 2012-01-23 16:27 UTC (permalink / raw)
  To: Niels de Vos; +Cc: linux-fsdevel, linux-kernel, Bryn M. Reeves, Mikulas Patocka

Niels de Vos <ndevos@redhat.com> writes:

> Executing an fsync() on a file-descriptor of a partition flushes the
> caches for that partition by calling blkdev_issue_flush(). However, it
> seems that reading data through the parent device will still return the
> old cached data.

What problem, exactly, are you trying to fix?  Could you please post a
reproducer?

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2] fs: Invalidate the cache for a parent block-device if fsync() is called for a partition
  2012-01-23 16:27     ` Jeff Moyer
@ 2012-01-23 16:46       ` Niels de Vos
  0 siblings, 0 replies; 23+ messages in thread
From: Niels de Vos @ 2012-01-23 16:46 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: linux-fsdevel, linux-kernel, Bryn M. Reeves, Mikulas Patocka

On 01/23/2012 04:27 PM, Jeff Moyer wrote:
> Niels de Vos <ndevos@redhat.com> writes:
> 
>> Executing an fsync() on a file-descriptor of a partition flushes the
>> caches for that partition by calling blkdev_issue_flush(). However, it
>> seems that reading data through the parent device will still return the
>> old cached data.
> 
> What problem, exactly, are you trying to fix?  Could you please post a
> reproducer?

The problem that was noticed is the following:
1) create two or more partitions on a device
   - use fdisk to create /dev/sdb1 and /dev/sdb2
2) format and mount one of the partition
   - mkfs -t ext3 /dev/sdb1
3) read through the main device to have something in the cache
   - read /dev/sdb with dd or use something like "parted /dev/sdb print"
4) now write something to /dev/sdb2, format the partition for example
   - mkfs -t ext3 /dev/sdb2
5) read the blocks where sdb2 starts, through /dev/sdb
   - use dd or do again a "parted /dev/sdb print"

Without this patch, calling "blockdev --flushbufs" or dropping the
caches, the result in 5) is the same as in 3). Reading the same area
through /dev/sdb2 shows the inconsistancy between the two caches.

With this patch, or one of the workarounds, the data read through
/dev/sdb and /dev/sdb2 is the same.

I hope this explains is clear enough, if not, please let me know.

Thanks,
Niels

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2] fs: Invalidate the cache for a parent block-device if fsync() is called for a partition
  2012-01-23 10:38   ` [PATCH v2] fs: Invalidate the cache for a parent block-device if fsync() is called for a partition Niels de Vos
  2012-01-23 16:27     ` Jeff Moyer
@ 2012-01-23 19:23     ` Mikulas Patocka
  2012-01-23 20:04     ` Jeff Moyer
  2012-01-26 10:03     ` Andrew Morton
  3 siblings, 0 replies; 23+ messages in thread
From: Mikulas Patocka @ 2012-01-23 19:23 UTC (permalink / raw)
  To: Niels de Vos; +Cc: linux-fsdevel, linux-kernel, Bryn M. Reeves

On Mon, 23 Jan 2012, Niels de Vos wrote:

> Executing an fsync() on a file-descriptor of a partition flushes the
> caches for that partition by calling blkdev_issue_flush(). However, it
> seems that reading data through the parent device will still return the
> old cached data.
> 
> The cache for the block-device is not synced if the block-device is kept
> open (due to a mounted partition, for example). Only when all users for
> the disk have exited, the cache for the disk is made consistent again.
> 
> Calling invalidate_bdev() on the parent block-device in case
> blkdev_fsync() was called for a partition, fixes this.
> 
> The problem can be worked around by forcing the caches to be flushed
> with either
> 	# blockdev --flushbufs ${dev_disk}
> or
> 	# echo 3 > /proc/sys/vm/drop_caches
> 
> CC: Bryn M. Reeves <bmr@redhat.com>
> CC: Mikulas Patocka <mpatocka@redhat.com>
> Signed-off-by: Niels de Vos <ndevos@redhat.com>
> 

Acked-by: Mikulas Patocka <mpatocka@redhat.com>

> ---
> v2:
> - Do not call invalidate_bdev() from blkdev_issue_flush() and prevent
>   performance degration with journalled filesystems.
> 
>   Suggested was to call invalidate_bdev() in fsync_bdev(), but this is
>   not in the call-path of mkfs.ext3 and similar tools. Hence the issue
>   persists.
> 
> - Correct phrasing a little, changing ioctl-BLKFLSBUF is not required.
> 
> - This issue also occurs when doing an ioctl-BLKFLSBUF on a partition.
>   Reading the whole disk will still return cached data. If this is an
>   issue, it will need a seperate patch.
> ---
>  fs/block_dev.c |    4 ++++
>  1 files changed, 4 insertions(+), 0 deletions(-)
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 0e575d1..433c4de 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -424,6 +424,10 @@ int blkdev_fsync(struct file *filp, loff_t start, loff_t end, int datasync)
>  	if (error == -EOPNOTSUPP)
>  		error = 0;
>  
> +	/* invalidate parent block_device */
> +	if (!error && bdev != bdev->bd_contains)
> +		invalidate_bdev(bdev->bd_contains);
> +
>  	return error;
>  }
>  EXPORT_SYMBOL(blkdev_fsync);
> -- 
> 1.7.6.5
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2] fs: Invalidate the cache for a parent block-device if fsync() is called for a partition
  2012-01-23 10:38   ` [PATCH v2] fs: Invalidate the cache for a parent block-device if fsync() is called for a partition Niels de Vos
  2012-01-23 16:27     ` Jeff Moyer
  2012-01-23 19:23     ` Mikulas Patocka
@ 2012-01-23 20:04     ` Jeff Moyer
  2012-01-26 10:03     ` Andrew Morton
  3 siblings, 0 replies; 23+ messages in thread
From: Jeff Moyer @ 2012-01-23 20:04 UTC (permalink / raw)
  To: Niels de Vos; +Cc: linux-fsdevel, linux-kernel, Bryn M. Reeves, Mikulas Patocka

Niels de Vos <ndevos@redhat.com> writes:

> Executing an fsync() on a file-descriptor of a partition flushes the
> caches for that partition by calling blkdev_issue_flush(). However, it
> seems that reading data through the parent device will still return the
> old cached data.

You are mixing up two different caches.  blkdev_issue_flush flushes the
hard disk's cache, if it exists, and is advertised as a write-back
cache.  The cache that doesn't get flushed is the page cache, since you
are accessing the same disk blocks via different devices (the whole
device and the partition).

Nevertheless, the fix looks fine to me.  There shouldn't be much cached
for the whole block device if there are partitions, so this shouldn't
generate an I/O storm or flush out buffers that are otherwise useful.

Reviewed-by: Jeff Moyer <jmoyer@redhat.com>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2] fs: Invalidate the cache for a parent block-device if fsync() is called for a partition
  2012-01-23 10:38   ` [PATCH v2] fs: Invalidate the cache for a parent block-device if fsync() is called for a partition Niels de Vos
                       ` (2 preceding siblings ...)
  2012-01-23 20:04     ` Jeff Moyer
@ 2012-01-26 10:03     ` Andrew Morton
  2012-01-26 11:50       ` Niels de Vos
  3 siblings, 1 reply; 23+ messages in thread
From: Andrew Morton @ 2012-01-26 10:03 UTC (permalink / raw)
  To: Niels de Vos
  Cc: linux-fsdevel, linux-kernel, Bryn M. Reeves, Mikulas Patocka, Al Viro

I suggest a viro cc on this one.

On Mon, 23 Jan 2012 10:38:29 +0000 Niels de Vos <ndevos@redhat.com> wrote:

> Executing an fsync() on a file-descriptor of a partition flushes the
> caches for that partition by calling blkdev_issue_flush(). However, it

The changelog is stale.

> seems that reading data through the parent device will still return the
> old cached data.
> 
> The cache for the block-device is not synced if the block-device is kept
> open (due to a mounted partition, for example). Only when all users for
> the disk have exited, the cache for the disk is made consistent again.
> 
> Calling invalidate_bdev() on the parent block-device in case
> blkdev_fsync() was called for a partition, fixes this.
> 
> The problem can be worked around by forcing the caches to be flushed
> with either
> 	# blockdev --flushbufs ${dev_disk}
> or
> 	# echo 3 > /proc/sys/vm/drop_caches

Please include your (useful) problem description in the changelog:

: The problem that was noticed is the following:
: 1) create two or more partitions on a device
:    - use fdisk to create /dev/sdb1 and /dev/sdb2
: 2) format and mount one of the partition
:    - mkfs -t ext3 /dev/sdb1
: 3) read through the main device to have something in the cache
:    - read /dev/sdb with dd or use something like "parted /dev/sdb print"
: 4) now write something to /dev/sdb2, format the partition for example
:    - mkfs -t ext3 /dev/sdb2
: 5) read the blocks where sdb2 starts, through /dev/sdb
:    - use dd or do again a "parted /dev/sdb print"
: 
: Without this patch, calling "blockdev --flushbufs" or dropping the
: caches, the result in 5) is the same as in 3). Reading the same area
: through /dev/sdb2 shows the inconsistancy between the two caches.
: 
: With this patch, or one of the workarounds, the data read through
: /dev/sdb and /dev/sdb2 is the same.

>
> ...
>
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -424,6 +424,10 @@ int blkdev_fsync(struct file *filp, loff_t start, loff_t end, int datasync)
>  	if (error == -EOPNOTSUPP)
>  		error = 0;
>  
> +	/* invalidate parent block_device */
> +	if (!error && bdev != bdev->bd_contains)
> +		invalidate_bdev(bdev->bd_contains);
> +
>  	return error;
>  }

It doesn't seem terribly logical to do this in blkdev_fsync().  Why not
do it right there in blkdev_ioctl()'s "case BLKFLSBUF"?

Bear in mind that invalidate_bdev() isn't a very strong function -
it won't drop pages which are dirty or under writeback nor pages which
others have a reference on.  But I can see that this change would be a
best-effort user-convenience thing.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2] fs: Invalidate the cache for a parent block-device if fsync() is called for a partition
  2012-01-26 10:03     ` Andrew Morton
@ 2012-01-26 11:50       ` Niels de Vos
  2012-01-26 13:33         ` [PATCH v3] " Niels de Vos
  0 siblings, 1 reply; 23+ messages in thread
From: Niels de Vos @ 2012-01-26 11:50 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-fsdevel, linux-kernel, Bryn M. Reeves, Mikulas Patocka, Al Viro

On 01/26/2012 10:03 AM, Andrew Morton wrote:
> I suggest a viro cc on this one.
> 
> On Mon, 23 Jan 2012 10:38:29 +0000 Niels de Vos <ndevos@redhat.com> wrote:
> 
>> Executing an fsync() on a file-descriptor of a partition flushes the
>> caches for that partition by calling blkdev_issue_flush(). However, it
> 
> The changelog is stale.

Hmm, looks like it. I'll update it and send out a corrected patch soon.

>> seems that reading data through the parent device will still return the
>> old cached data.
>>
>> The cache for the block-device is not synced if the block-device is kept
>> open (due to a mounted partition, for example). Only when all users for
>> the disk have exited, the cache for the disk is made consistent again.
>>
>> Calling invalidate_bdev() on the parent block-device in case
>> blkdev_fsync() was called for a partition, fixes this.
>>
>> The problem can be worked around by forcing the caches to be flushed
>> with either
>> 	# blockdev --flushbufs ${dev_disk}
>> or
>> 	# echo 3 > /proc/sys/vm/drop_caches
> 
> Please include your (useful) problem description in the changelog:
> 
> : The problem that was noticed is the following:
> : 1) create two or more partitions on a device
> :    - use fdisk to create /dev/sdb1 and /dev/sdb2
> : 2) format and mount one of the partition
> :    - mkfs -t ext3 /dev/sdb1
> : 3) read through the main device to have something in the cache
> :    - read /dev/sdb with dd or use something like "parted /dev/sdb print"
> : 4) now write something to /dev/sdb2, format the partition for example
> :    - mkfs -t ext3 /dev/sdb2
> : 5) read the blocks where sdb2 starts, through /dev/sdb
> :    - use dd or do again a "parted /dev/sdb print"
> : 
> : Without this patch, calling "blockdev --flushbufs" or dropping the
> : caches, the result in 5) is the same as in 3). Reading the same area
> : through /dev/sdb2 shows the inconsistancy between the two caches.
> : 
> : With this patch, or one of the workarounds, the data read through
> : /dev/sdb and /dev/sdb2 is the same.

I'll include this too.

>>
>> ...
>>
>> --- a/fs/block_dev.c
>> +++ b/fs/block_dev.c
>> @@ -424,6 +424,10 @@ int blkdev_fsync(struct file *filp, loff_t start, loff_t end, int datasync)
>>  	if (error == -EOPNOTSUPP)
>>  		error = 0;
>>  
>> +	/* invalidate parent block_device */
>> +	if (!error && bdev != bdev->bd_contains)
>> +		invalidate_bdev(bdev->bd_contains);
>> +
>>  	return error;
>>  }
> 
> It doesn't seem terribly logical to do this in blkdev_fsync().  Why not
> do it right there in blkdev_ioctl()'s "case BLKFLSBUF"?

Most userspace tools do not call the blkdev_ioctl(), but only do a
fsync(). Therefore it does not help existing tools if we update that
calling path. I agree it seems more logical to make the change there,
but it is out of scope for the use-cases I have seen until now.

> Bear in mind that invalidate_bdev() isn't a very strong function -
> it won't drop pages which are dirty or under writeback nor pages which
> others have a reference on.  But I can see that this change would be a
> best-effort user-convenience thing.

It is only intended to refresh the read caches. The writeback path
should not be affected.

Thanks for the feedback,
Niels

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v3] fs: Invalidate the cache for a parent block-device if fsync() is called for a partition
  2012-01-26 11:50       ` Niels de Vos
@ 2012-01-26 13:33         ` Niels de Vos
  2012-01-26 21:40           ` Andrew Morton
  0 siblings, 1 reply; 23+ messages in thread
From: Niels de Vos @ 2012-01-26 13:33 UTC (permalink / raw)
  To: linux-fsdevel, Andrew Morton
  Cc: linux-kernel, Al Viro, Mikulas Patocka, Jeff Moyer, Niels de Vos,
	Bryn M. Reeves

Executing an fsync() on a file-descriptor of a partition flushes the
caches for that partition by calling blkdev_fsync(). However, it seems
that reading data through the parent device will still return the old
cached data.

The problem can be worked around by forcing the caches to be flushed
with either
	# blockdev --flushbufs ${dev_disk}
or
	# echo 3 > /proc/sys/vm/drop_caches

One of the use-cases that shows this problem:
1) create two or more partitions on a device
   - use fdisk to create /dev/sdb1 and /dev/sdb2
2) format and mount one of the partition
   - mkfs -t ext3 /dev/sdb1
3) read through the main device to have something in the cache
   - read /dev/sdb with dd or use something like "parted /dev/sdb print"
4) now write something to /dev/sdb2, format the partition for example
   - mkfs -t ext3 /dev/sdb2
5) read the blocks where sdb2 starts, through /dev/sdb
   - use dd or do again a "parted /dev/sdb print"

The cache for the block-device is not synced if the block-device is kept
open (due to a mounted partition, for example). Only when all users for
the disk have exited, the cache for the disk is made consistent again.

Without this patch, calling "blockdev --flushbufs" or dropping the
caches, the result in 5) is the same as in 3). Reading the same area
through /dev/sdb2 shows the inconsistancy between the two caches.

CC: Bryn M. Reeves <bmr@redhat.com>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Niels de Vos <ndevos@redhat.com>

---
v3:
- Correct commit message, no need to mention blkdev_issue_flush().

- Include a use-case for giving a clearer understaning of the problem
  that is being addressed.

- Include received Acked-by and Reviewed-by, only the commit message
  changed in this v3.

v2:
- Do not call invalidate_bdev() from blkdev_issue_flush() and prevent
  performance degration with journalled filesystems.

  Suggested was to call invalidate_bdev() in fsync_bdev(), but this is
  not in the call-path of mkfs.ext3 and similar tools. Hence the issue
  persists.

- Correct phrasing a little, changing ioctl-BLKFLSBUF is not required.

- This issue also occurs when doing an ioctl-BLKFLSBUF on a partition.
  Reading the whole disk will still return cached data. If this is an
  issue, it will need a seperate patch.
---
 fs/block_dev.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 0e575d1..433c4de 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -424,6 +424,10 @@ int blkdev_fsync(struct file *filp, loff_t start, loff_t end, int datasync)
 	if (error == -EOPNOTSUPP)
 		error = 0;
 
+	/* invalidate parent block_device */
+	if (!error && bdev != bdev->bd_contains)
+		invalidate_bdev(bdev->bd_contains);
+
 	return error;
 }
 EXPORT_SYMBOL(blkdev_fsync);
-- 
1.7.6.5


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v3] fs: Invalidate the cache for a parent block-device if fsync() is called for a partition
  2012-01-26 13:33         ` [PATCH v3] " Niels de Vos
@ 2012-01-26 21:40           ` Andrew Morton
  2012-01-26 21:45             ` Christoph Hellwig
                               ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Andrew Morton @ 2012-01-26 21:40 UTC (permalink / raw)
  To: Niels de Vos
  Cc: linux-fsdevel, linux-kernel, Al Viro, Mikulas Patocka,
	Jeff Moyer, Bryn M. Reeves

On Thu, 26 Jan 2012 13:33:22 +0000
Niels de Vos <ndevos@redhat.com> wrote:

> Executing an fsync() on a file-descriptor of a partition flushes the
> caches for that partition by calling blkdev_fsync(). However, it seems
> that reading data through the parent device will still return the old
> cached data.
> 
> The problem can be worked around by forcing the caches to be flushed
> with either
> 	# blockdev --flushbufs ${dev_disk}
> or
> 	# echo 3 > /proc/sys/vm/drop_caches
> 
> One of the use-cases that shows this problem:
> 1) create two or more partitions on a device
>    - use fdisk to create /dev/sdb1 and /dev/sdb2
> 2) format and mount one of the partition
>    - mkfs -t ext3 /dev/sdb1
> 3) read through the main device to have something in the cache
>    - read /dev/sdb with dd or use something like "parted /dev/sdb print"
> 4) now write something to /dev/sdb2, format the partition for example
>    - mkfs -t ext3 /dev/sdb2
> 5) read the blocks where sdb2 starts, through /dev/sdb
>    - use dd or do again a "parted /dev/sdb print"
> 
> The cache for the block-device is not synced if the block-device is kept
> open (due to a mounted partition, for example). Only when all users for
> the disk have exited, the cache for the disk is made consistent again.
> 
> Without this patch, calling "blockdev --flushbufs" or dropping the
> caches, the result in 5) is the same as in 3). Reading the same area
> through /dev/sdb2 shows the inconsistancy between the two caches.
> 
> ...
>
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -424,6 +424,10 @@ int blkdev_fsync(struct file *filp, loff_t start, loff_t end, int datasync)
>  	if (error == -EOPNOTSUPP)
>  		error = 0;
>  
> +	/* invalidate parent block_device */
> +	if (!error && bdev != bdev->bd_contains)
> +		invalidate_bdev(bdev->bd_contains);
> +
>  	return error;
>  }
>  EXPORT_SYMBOL(blkdev_fsync);

I can't say I'm a huge fan of this.  It just isn't logical to drop
/dev/sda's pagecache in here.

We're adapting the kernel to the behavior of existing userspace by
inserting a useful side-effect into a suprising place.  The result is
pretty darned hacky.

The Right Thing To Do here is to make the kernel behave logically and
predictably, then modify the userspace tools.  But if we're modifying
the userspace tools then we would just change userspace to issue a
BLKFLSBUF to /dev/sda and leave the kernel alone.

So hm.  I think I might prefer to leave the issue unfixed rather than
doing this to the poor old kernel :(


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3] fs: Invalidate the cache for a parent block-device if fsync() is called for a partition
  2012-01-26 21:40           ` Andrew Morton
@ 2012-01-26 21:45             ` Christoph Hellwig
  2012-01-26 21:50               ` Mikulas Patocka
  2012-01-31 16:00               ` Niels de Vos
  2012-01-26 21:49             ` Jeff Moyer
  2012-01-26 22:13             ` Mikulas Patocka
  2 siblings, 2 replies; 23+ messages in thread
From: Christoph Hellwig @ 2012-01-26 21:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Niels de Vos, linux-fsdevel, linux-kernel, Al Viro,
	Mikulas Patocka, Jeff Moyer, Bryn M. Reeves

On Thu, Jan 26, 2012 at 01:40:51PM -0800, Andrew Morton wrote:
> The Right Thing To Do here is to make the kernel behave logically and
> predictably, then modify the userspace tools.  But if we're modifying
> the userspace tools then we would just change userspace to issue a
> BLKFLSBUF to /dev/sda and leave the kernel alone.

The right fix is to make partition and whole disk access coherent,
which is fairly simply:

 - create the block device inode/mapping per gendisk, and only reference
   count it per block_device
 - make sure blkdev_get_block(s) applies the correct offset if used on
   partitions


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3] fs: Invalidate the cache for a parent block-device if fsync() is called for a partition
  2012-01-26 21:40           ` Andrew Morton
  2012-01-26 21:45             ` Christoph Hellwig
@ 2012-01-26 21:49             ` Jeff Moyer
  2012-01-26 22:13             ` Mikulas Patocka
  2 siblings, 0 replies; 23+ messages in thread
From: Jeff Moyer @ 2012-01-26 21:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Niels de Vos, linux-fsdevel, linux-kernel, Al Viro,
	Mikulas Patocka, Bryn M. Reeves

Andrew Morton <akpm@linux-foundation.org> writes:

> On Thu, 26 Jan 2012 13:33:22 +0000
> Niels de Vos <ndevos@redhat.com> wrote:
>
>> Executing an fsync() on a file-descriptor of a partition flushes the
>> caches for that partition by calling blkdev_fsync(). However, it seems
>> that reading data through the parent device will still return the old
>> cached data.
>> 
>> The problem can be worked around by forcing the caches to be flushed
>> with either
>> 	# blockdev --flushbufs ${dev_disk}
>> or
>> 	# echo 3 > /proc/sys/vm/drop_caches
>> 
>> One of the use-cases that shows this problem:
>> 1) create two or more partitions on a device
>>    - use fdisk to create /dev/sdb1 and /dev/sdb2
>> 2) format and mount one of the partition
>>    - mkfs -t ext3 /dev/sdb1
>> 3) read through the main device to have something in the cache
>>    - read /dev/sdb with dd or use something like "parted /dev/sdb print"
>> 4) now write something to /dev/sdb2, format the partition for example
>>    - mkfs -t ext3 /dev/sdb2
>> 5) read the blocks where sdb2 starts, through /dev/sdb
>>    - use dd or do again a "parted /dev/sdb print"
>> 
>> The cache for the block-device is not synced if the block-device is kept
>> open (due to a mounted partition, for example). Only when all users for
>> the disk have exited, the cache for the disk is made consistent again.
>> 
>> Without this patch, calling "blockdev --flushbufs" or dropping the
>> caches, the result in 5) is the same as in 3). Reading the same area
>> through /dev/sdb2 shows the inconsistancy between the two caches.
>> 
>> ...
>>
>> --- a/fs/block_dev.c
>> +++ b/fs/block_dev.c
>> @@ -424,6 +424,10 @@ int blkdev_fsync(struct file *filp, loff_t start, loff_t end, int datasync)
>>  	if (error == -EOPNOTSUPP)
>>  		error = 0;
>>  
>> +	/* invalidate parent block_device */
>> +	if (!error && bdev != bdev->bd_contains)
>> +		invalidate_bdev(bdev->bd_contains);
>> +
>>  	return error;
>>  }
>>  EXPORT_SYMBOL(blkdev_fsync);
>
> I can't say I'm a huge fan of this.  It just isn't logical to drop
> /dev/sda's pagecache in here.
>
> We're adapting the kernel to the behavior of existing userspace by
> inserting a useful side-effect into a suprising place.  The result is
> pretty darned hacky.
>
> The Right Thing To Do here is to make the kernel behave logically and
> predictably, then modify the userspace tools.

To me, logically the caches should not be separate.  /dev/sda and
/dev/sda1 are the same underlying device, after all, and there shouldn't
be any cache aliases.  Unfortunately, that sounds like a rather large
change to me, with a completely new set of side effects to deal with.

> But if we're modifying the userspace tools then we would just change
> userspace to issue a BLKFLSBUF to /dev/sda and leave the kernel alone.
>
> So hm.  I think I might prefer to leave the issue unfixed rather than
> doing this to the poor old kernel :(

I could certainly understand taking this stance.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3] fs: Invalidate the cache for a parent block-device if fsync() is called for a partition
  2012-01-26 21:45             ` Christoph Hellwig
@ 2012-01-26 21:50               ` Mikulas Patocka
  2012-01-27 12:19                 ` Ric Wheeler
  2012-01-31 16:00               ` Niels de Vos
  1 sibling, 1 reply; 23+ messages in thread
From: Mikulas Patocka @ 2012-01-26 21:50 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, Niels de Vos, linux-fsdevel, linux-kernel,
	Al Viro, Jeff Moyer, Bryn M. Reeves



On Thu, 26 Jan 2012, Christoph Hellwig wrote:

> On Thu, Jan 26, 2012 at 01:40:51PM -0800, Andrew Morton wrote:
> > The Right Thing To Do here is to make the kernel behave logically and
> > predictably, then modify the userspace tools.  But if we're modifying
> > the userspace tools then we would just change userspace to issue a
> > BLKFLSBUF to /dev/sda and leave the kernel alone.
> 
> The right fix is to make partition and whole disk access coherent,
> which is fairly simply:
> 
>  - create the block device inode/mapping per gendisk, and only reference
>    count it per block_device
>  - make sure blkdev_get_block(s) applies the correct offset if used on
>    partitions

... and what if you use 4kB blocksize and the partition start is not 
aligned on 4kB? (quite common case, because partitions are often aligned 
on 63 sectors) Then, you can't translate partition block numbers into disk 
block numbers.

Mikulas

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3] fs: Invalidate the cache for a parent block-device if fsync() is called for a partition
  2012-01-26 21:40           ` Andrew Morton
  2012-01-26 21:45             ` Christoph Hellwig
  2012-01-26 21:49             ` Jeff Moyer
@ 2012-01-26 22:13             ` Mikulas Patocka
  2012-01-26 22:26               ` Kernel Oops report (Android gingerbread) Fan Zhang
  2 siblings, 1 reply; 23+ messages in thread
From: Mikulas Patocka @ 2012-01-26 22:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Niels de Vos, linux-fsdevel, linux-kernel, Al Viro, Jeff Moyer,
	Bryn M. Reeves



On Thu, 26 Jan 2012, Andrew Morton wrote:

> On Thu, 26 Jan 2012 13:33:22 +0000
> Niels de Vos <ndevos@redhat.com> wrote:
> 
> > Executing an fsync() on a file-descriptor of a partition flushes the
> > caches for that partition by calling blkdev_fsync(). However, it seems
> > that reading data through the parent device will still return the old
> > cached data.
> > 
> > The problem can be worked around by forcing the caches to be flushed
> > with either
> > 	# blockdev --flushbufs ${dev_disk}
> > or
> > 	# echo 3 > /proc/sys/vm/drop_caches
> > 
> > One of the use-cases that shows this problem:
> > 1) create two or more partitions on a device
> >    - use fdisk to create /dev/sdb1 and /dev/sdb2
> > 2) format and mount one of the partition
> >    - mkfs -t ext3 /dev/sdb1
> > 3) read through the main device to have something in the cache
> >    - read /dev/sdb with dd or use something like "parted /dev/sdb print"
> > 4) now write something to /dev/sdb2, format the partition for example
> >    - mkfs -t ext3 /dev/sdb2
> > 5) read the blocks where sdb2 starts, through /dev/sdb
> >    - use dd or do again a "parted /dev/sdb print"
> > 
> > The cache for the block-device is not synced if the block-device is kept
> > open (due to a mounted partition, for example). Only when all users for
> > the disk have exited, the cache for the disk is made consistent again.
> > 
> > Without this patch, calling "blockdev --flushbufs" or dropping the
> > caches, the result in 5) is the same as in 3). Reading the same area
> > through /dev/sdb2 shows the inconsistancy between the two caches.
> > 
> > ...
> >
> > --- a/fs/block_dev.c
> > +++ b/fs/block_dev.c
> > @@ -424,6 +424,10 @@ int blkdev_fsync(struct file *filp, loff_t start, loff_t end, int datasync)
> >  	if (error == -EOPNOTSUPP)
> >  		error = 0;
> >  
> > +	/* invalidate parent block_device */
> > +	if (!error && bdev != bdev->bd_contains)
> > +		invalidate_bdev(bdev->bd_contains);
> > +
> >  	return error;
> >  }
> >  EXPORT_SYMBOL(blkdev_fsync);
> 
> I can't say I'm a huge fan of this.  It just isn't logical to drop
> /dev/sda's pagecache in here.
> 
> We're adapting the kernel to the behavior of existing userspace by
> inserting a useful side-effect into a suprising place.  The result is
> pretty darned hacky.
> 
> The Right Thing To Do here is to make the kernel behave logically and
> predictably, then modify the userspace tools.  But if we're modifying
> the userspace tools then we would just change userspace to issue a
> BLKFLSBUF to /dev/sda and leave the kernel alone.

And how should userspace find out if the device is a partition and what is 
the master device? For example, if I tell you that you have a block 
device with major 259 and minor 2, how will you find out:
1) is it a partition or not?
2) what is the whole-disk device?

Do you think that userspace should recursively scan /sys/block to find out 
disk-partition relationships?

> So hm.  I think I might prefer to leave the issue unfixed rather than
> doing this to the poor old kernel :(

The major problem here is that "invalidate_bdev" is unreliable. 
"invalidate_bdev" skips blocks that are beign read or written. udev may 
open and read any block device anytime. Consequently, "invalidate_bdev" 
may skip to invalidate any block device anytime (if it happens to be 
racing with udev).

But this problem is irrelevant w.r.t the patch - the patch makes it 
neither better nor worse. If someone executes "blockdev --flushbufs" or 
ioctl BLKFLSBUF, it may not flush all buffers.

If you want to make the kernel behave predictably, you should fix 
"invalidate_bdev" so that it invalidates the whole device, even if someone 
is readint it.

Mikulas

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Kernel Oops report (Android gingerbread)
  2012-01-26 22:13             ` Mikulas Patocka
@ 2012-01-26 22:26               ` Fan Zhang
  0 siblings, 0 replies; 23+ messages in thread
From: Fan Zhang @ 2012-01-26 22:26 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel

Platform:

Android gingerbread


<1>[  626.353872] Unable to handle kernel NULL pointer dereference at virtual ad
dress 00000000
<1>[  626.353907] pgd = c0004000
<1>[  626.353923] [00000000] *pgd=00000000
<0>[  626.353956] Internal error: Oops: 17 [#1] PREEMPT SMP
<0>[  626.353977] last sysfs file: /sys/devices/system/cpu/cpu0/cpufreq/stats/ti
me_in_state
<4>[  626.354003] Modules linked in: j4fs(P) Si4709_driver bthid vibrator
<4>[  626.354059] CPU: 0    Tainted: P        W    (2.6.35.7 #500)
<4>[  626.354117] PC is at page_address+0x14/0xf0
<4>[  626.354150] LR is at blk_rq_bio_prep+0x74/0xb0
<4>[  626.354175] pc : [<c040e268>]    lr : [<c04feb60>]    psr: 20000013
<4>[  626.354188] sp : ea3fdd98  ip : ea3fddc0  fp : ea3fddbc
<4>[  626.354211] r10: c0b12680  r9 : 0b300010  r8 : 00000000
<4>[  626.354233] r7 : c0b12f40  r6 : 0000000c  r5 : f15e34e0  r4 : f0356300
<4>[  626.354256] r3 : 00000000  r2 : f0356348  r1 : f0356300  r0 : 00000000
<4>[  626.354283] Flags: nzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment k
ernel
<4>[  626.354309] Control: 10c53c7d  Table: 70d5c04a  DAC: 00000017
<4>[  626.354330]
<4>[  626.354336] PC: 0xc040e1e8:
<4>[  626.354350] e1e8  e5843008 e5836004 e5826000 e584200c e1a01000 e59f0044 eb
102b89 e89dadf0
<4>[  626.354415] e208  e2434008 e1a06004 e5b63008 f5d3f000 e1560005 1affffe1 e1
a00008 eafffff5
<4>[  626.354480] e228  e59f0014 e3001173 ebfd089f c0b116c0 9e370001 c0b66e20 c0
b64e00 c092ddbf
<4>[  626.354545] e248  c0b67e20 00100100 00200200 e1a0c00d e92ddbf0 e24cb004 e9
2d4000 e8bd4000
<4>[  626.354610] e268  e5902000 e59f10c0 e1a04000 e7e13d52 e3a00e32 e0231390 e5
93130c e0613003
<4>[  626.354675] e288  e1530000 1a00001c e59f80a0 e59f70a0 e0080498 e1a08ca8 e0
875288 e2856008
<4>[  626.354739] e2a8  e1a00006 eb102c07 e7973288 e1530005 12433008 e1a01000 1a
000005 ea000009
<4>[  626.354804] e2c8  e5932000 e1520004 05934004 0a000006 e2403008 e1a02003 e5
b20008 f5d0f000
<4>[  626.354871]
<4>[  626.354877] LR: 0xc04feae0:
<4>[  626.354890] eae0  e3540000 1affffee e89dadf0 e1a0c00d e92dd8f0 e24cb004 e9
2d4000 e8bd4000
<4>[  626.354954] eb00  e5923014 e1a04002 e3520000 e5912020 e2033001 e1a05001 e1
823003 e5813020
<4>[  626.355019] eb20  0a000014 e5943038 e3530000 0a000011 e1a01004 ebfd59dd e1
c507b0 e1d431b8
<4>[  626.355084] eb40  e3530000 0a00000a e1d431ba e3a0600c e5942038 e0030396 e7
920003 ebfc3dbc
<4>[  626.355148] eb60  e5943038 e1d421ba e0263296 e5963008 e0803003 e585307c e5
943020 e5854044
<4>[  626.355213] eb80  e5854040 e5853030 e594300c e3530000 1593306c 15853068 e8
9da8f0 e1a0c00d
<4>[  626.355277] eba0  e92dd830 e24cb004 e92d4000 e8bd4000 e1a0c001 e5911030 e3
a02001 e5802024
<4>[  626.355341] ebc0  e1a03000 e5902020 e580102c e59c1014 e3110010 0201100e 13
82200e 01822001
<4>[  626.355408]
<4>[  626.355414] SP: 0xea3fdd18:
<4>[  626.355428] dd18  01080001 eff9f2c4 00000000 f15e34e0 00000001 f16a1188 ff
ffffff ea3fdd84
<4>[  626.355493] dd38  0000000c c0b12f40 ea3fddbc ea3fdd50 c034bcec c034b4e0 00
000000 f0356300
<4>[  626.355557] dd58  f0356348 00000000 f0356300 f15e34e0 0000000c c0b12f40 00
000000 0b300010
<4>[  626.355621] dd78  c0b12680 ea3fddbc ea3fddc0 ea3fdd98 c04feb60 c040e268 20
000013 ffffffff
<4>[  626.355685] dd98  f0356300 f15e34e0 0000000c c0b12f40 00000000 0b300010 ea
3fdddc ea3fddc0
<4>[  626.355750] ddb8  c04feb60 c040e260 f0356e80 00000000 f16a1188 c0b12f40 ea
3fddf4 ea3fdde0
<4>[  626.355814] ddd8  c04fec70 c04feaf8 f16a1188 00000001 ea3fde34 ea3fddf8 c0
501e0c c04feba8
<4>[  626.355878] ddf8  ea3fde54 f0356300 c0382f30 f15e34e0 c0a7b9c8 f0356300 ea
3fc000 f16a1188
<4>[  626.355943]
<4>[  626.355949] IP: 0xea3fdd40:
<4>[  626.355962] dd40  ea3fddbc ea3fdd50 c034bcec c034b4e0 00000000 f0356300 f0
356348 00000000
<4>[  626.356027] dd60  f0356300 f15e34e0 0000000c c0b12f40 00000000 0b300010 c0
b12680 ea3fddbc
<4>[  626.356091] dd80  ea3fddc0 ea3fdd98 c04feb60 c040e268 20000013 ffffffff f0
356300 f15e34e0
<4>[  626.356155] dda0  0000000c c0b12f40 00000000 0b300010 ea3fdddc ea3fddc0 c0
4feb60 c040e260
<4>[  626.356219] ddc0  f0356e80 00000000 f16a1188 c0b12f40 ea3fddf4 ea3fdde0 c0
4fec70 c04feaf8
<4>[  626.356284] dde0  f16a1188 00000001 ea3fde34 ea3fddf8 c0501e0c c04feba8 ea
3fde54 f0356300
<4>[  626.356348] de00  c0382f30 f15e34e0 c0a7b9c8 f0356300 ea3fc000 f16a1188 c0
b12f40 00000000
<4>[  626.356412] de20  0b300010 c0b12680 ea3fdef4 ea3fde38 c04ffca8 c0501a10 ea
3fc000 c0b68730
<4>[  626.356478]
<4>[  626.356483] FP: 0xea3fdd3c:
<4>[  626.356497] dd3c  c0b12f40 ea3fddbc ea3fdd50 c034bcec c034b4e0 00000000 f0
356300 f0356348
<4>[  626.356561] dd5c  00000000 f0356300 f15e34e0 0000000c c0b12f40 00000000 0b
300010 c0b12680
<4>[  626.356625] dd7c  ea3fddbc ea3fddc0 ea3fdd98 c04feb60 c040e268 20000013 ff
ffffff f0356300
<4>[  626.356690] dd9c  f15e34e0 0000000c c0b12f40 00000000 0b300010 ea3fdddc ea
3fddc0 c04feb60
<4>[  626.356754] ddbc  c040e260 f0356e80 00000000 f16a1188 c0b12f40 ea3fddf4 ea
3fdde0 c04fec70
<4>[  626.356818] dddc  c04feaf8 f16a1188 00000001 ea3fde34 ea3fddf8 c0501e0c c0
4feba8 ea3fde54
<4>[  626.356882] ddfc  f0356300 c0382f30 f15e34e0 c0a7b9c8 f0356300 ea3fc000 f1
6a1188 c0b12f40
<4>[  626.356947] de1c  00000000 0b300010 c0b12680 ea3fdef4 ea3fde38 c04ffca8 c0
501a10 ea3fc000
<4>[  626.357013]
<4>[  626.357019] R1: 0xf0356280:
<4>[  626.357032] 6280  f0356980 00000000 00000000 f18f02a0 30000009 00000000 00
000020 00000020
<4>[  626.357096] 62a0  00000000 00001000 00001000 00000040 ffffffff 00000000 f1
593300 c0459990
<4>[  626.357161] 62c0  00000000 c04556ac c1161760 00001000 00000000 c08286a0 f0
356280 00000000
<4>[  626.357225] 62e0  f1c01040 f1c03280 f1802098 0000001b 00001000 00000000 ff
ffffff ffffffff
<4>[  626.357289] 6300  f0356e80 00000000 00000000 f18222a0 f0000009 00000001 00
000001 00000001
<4>[  626.357353] 6320  00000000 00001000 00001000 00000004 ffffffff 00000000 f0
356348 c0675ac4
<4>[  626.357417] 6340  ead41f00 c0676978 00000000 00001000 00000000 c12551e0 00
001000 00000000
<4>[  626.357481] 6360  c1255240 00001000 00000000 c1255220 00001000 00000000 ff
ffffff ffffffff
<4>[  626.357547]
<4>[  626.357553] R2: 0xf03562c8:
<4>[  626.357566] 62c8  c1161760 00001000 00000000 c08286a0 f0356280 00000000 f1
c01040 f1c03280
<4>[  626.357631] 62e8  f1802098 0000001b 00001000 00000000 ffffffff ffffffff f0
356e80 00000000
<4>[  626.357695] 6308  00000000 f18222a0 f0000009 00000001 00000001 00000001 00
000000 00001000
<4>[  626.357758] 6328  00001000 00000004 ffffffff 00000000 f0356348 c0675ac4 ea
d41f00 c0676978
<4>[  626.357822] 6348  00000000 00001000 00000000 c12551e0 00001000 00000000 c1
255240 00001000
<4>[  626.357886] 6368  00000000 c1255220 00001000 00000000 ffffffff ffffffff f1
e95481 f0596500
<4>[  626.357950] 6388  f1e95780 00000000 bf001908 f0596510 00000100 f03563d8 7f
ffffff 00000000
<4>[  626.358014] 63a8  00000013 00000000 00000000 00000000 0808981e 0000003c 00
000000 00000000
<4>[  626.358078]
<4>[  626.358084] R4: 0xf0356280:
<4>[  626.358097] 6280  f0356980 00000000 00000000 f18f02a0 30000009 00000000 00
000020 00000020
<4>[  626.358161] 62a0  00000000 00001000 00001000 00000040 ffffffff 00000000 f1
593300 c0459990
<4>[  626.358225] 62c0  00000000 c04556ac c1161760 00001000 00000000 c08286a0 f0
356280 00000000
<4>[  626.358288] 62e0  f1c01040 f1c03280 f1802098 0000001b 00001000 00000000 ff
ffffff ffffffff
<4>[  626.358352] 6300  f0356e80 00000000 00000000 f18222a0 f0000009 00000001 00
000001 00000001
<4>[  626.358416] 6320  00000000 00001000 00001000 00000004 ffffffff 00000000 f0
356348 c0675ac4
<4>[  626.358480] 6340  ead41f00 c0676978 00000000 00001000 00000000 c12551e0 00
001000 00000000
<4>[  626.358544] 6360  c1255240 00001000 00000000 c1255220 00001000 00000000 ff
ffffff ffffffff
<4>[  626.358609]
<4>[  626.358615] R5: 0xf15e3460:
<4>[  626.358628] 3460  f15e3461 f15e3870 00000000 00000000 00000000 00000000 f1
ebe800 0000feb4
<4>[  626.358692] 3480  00000002 00000000 00000000 de9df000 ffffffff 00000000 00
000000 00000000
<4>[  626.358756] 34a0  00000000 00000000 f15e3498 00000010 00000000 00000000 00
002000 00000000
<4>[  626.358819] 34c0  00000000 f15e34c4 f15e34c4 00000000 00000000 00000000 00
000000 00000000
<4>[  626.358883] 34e0  f15e34e0 f15e34e0 00000000 00000000 00000000 00000000 00
000000 f16a1188
<4>[  626.358946] 3500  01082001 00000001 00000000 ffffffff 00000000 00000000 f0
356e80 00000000
<4>[  626.359010] 3520  00000000 00000000 00000000 00000000 f15e3530 00000000 00
000000 eff9f2c0
<4>[  626.359073] 3540  eff8c8e8 f1705428 00000000 0000fef1 00000001 00000001 00
000000 00000000
<4>[  626.359138]
<4>[  626.359144] R7: 0xc0b12ec0:
<4>[  626.359157] 2ec0  c09957b4 00000000 00000000 00000000 00000000 00000000 00
000000 00000000
<4>[  626.359220] 2ee0  c09957cc 00000000 00000000 00000000 00000000 00000000 00
000000 00000000
<4>[  626.359283] 2f00  c09957dc 00000000 00000000 00000000 00000000 00000000 00
000000 00000000
<4>[  626.359346] 2f20  c09957f4 00000000 00000000 00000000 00000000 00000000 00
000000 00000000
<4>[  626.359408] 2f40  c0995908 00000000 00000000 00000000 00000000 00000000 00
000000 00000000
<4>[  626.359471] 2f60  c09958f4 00000000 00000000 00000000 00000000 00000000 00
000000 00000000
<4>[  626.359533] 2f80  c0995834 00000000 00000000 00000000 00000000 00000000 00
000000 00000000
<4>[  626.359596] 2fa0  c0995924 00000000 00000000 00000000 00000000 00000000 00
000000 00000000
<4>[  626.359660]
<4>[  626.359666] R10: 0xc0b12600:
<4>[  626.359679] 2600  c0995234 00000000 00000000 00000000 00000000 00000000 00
000000 00000000
<4>[  626.359742] 2620  c0995254 00000000 00000000 00000000 00000000 00000000 00
000000 00000000
<4>[  626.359805] 2640  c099526c 00000000 00000000 00000000 00000000 00000000 00
000000 00000000
<4>[  626.359868] 2660  c0995244 00000000 00000000 00000000 00000000 00000000 00
000000 00000000
<4>[  626.359930] 2680  c09952a8 00000000 00000000 00000000 00000000 00000000 00
000000 00000000
<4>[  626.359993] 26a0  c0995294 00000000 00000000 00000000 00000000 00000000 00
000000 00000000
<4>[  626.360056] 26c0  c0995280 00000000 00000000 00000000 00000000 00000000 00
000000 00000000
<4>[  626.360118] 26e0  c09952bc 00000000 00000000 00000000 00000000 00000000 00
000000 00000000
<0>[  626.360187] Process kcryptd_io (pid: 3469, stack limit = 0xea3fc2f0)
<0>[  626.360211] Stack: (0xea3fdd98 to 0xea3fe000)
<0>[  626.360233] dd80:                                                       f0
356300 f15e34e0
<0>[  626.360267] dda0: 0000000c c0b12f40 00000000 0b300010 ea3fdddc ea3fddc0 c0
4feb60 c040e260
<0>[  626.360301] ddc0: f0356e80 00000000 f16a1188 c0b12f40 ea3fddf4 ea3fdde0 c0
4fec70 c04feaf8
<0>[  626.360334] dde0: f16a1188 00000001 ea3fde34 ea3fddf8 c0501e0c c04feba8 ea
3fde54 f0356300
<0>[  626.360368] de00: c0382f30 f15e34e0 c0a7b9c8 f0356300 ea3fc000 f16a1188 c0
b12f40 00000000
<0>[  626.360402] de20: 0b300010 c0b12680 ea3fdef4 ea3fde38 c04ffca8 c0501a10 ea
3fc000 c0b68730
<0>[  626.360436] de40: ea3fdeec ea3fde50 c03f5bfc c051f674 c0383218 c0382d2c 00
000000 00000000
<0>[  626.360469] de60: f16a1188 c03af294 f0356e80 00000000 fffffffe 00000000 ea
3fdeac ea3fde88
<0>[  626.360502] de80: c03af294 c035f228 ea3fdeac ea3fde98 c035f228 c035f180 00
000000 0076f2d4
<0>[  626.360536] dea0: ea3fdebc ea3fdeb0 c03b0300 c035f218 ea3fded4 ea3fdec0 c0
37cffc c03bea00
<0>[  626.360569] dec0: 00000000 00000000 ea3fdeec ead41f00 ea3fc000 ead41f08 f1
6b8000 ead41f0c
<0>[  626.360603] dee0: c06764c4 c0b12680 ea3fdf54 ea3fdef8 c0676950 c04ff78c 00
000001 ea3fc000
<0>[  626.360637] df00: ea3fdf24 ea3fdf10 ea3fdf2c ea3fdf18 c037cffc c03bea00 00
000001 ea3fc000
<0>[  626.360671] df20: ea3fdf44 ea3fdf30 c037e1c0 fafe5320 ea3fc000 ead41f08 f1
6b8000 ead41f0c
<0>[  626.360704] df40: c06764c4 c0b12680 ea3fdfac ea3fdf58 c03a49ec c06764d0 60
000013 c03a91a8
<0>[  626.360738] df60: fafe533c fafe5334 c0347964 00000000 f16b8000 c03a96c0 ea
3fdf78 ea3fdf78
<0>[  626.360771] df80: c03783b8 ead39cf4 ea3fdfb8 c03a47ec fafe5320 00000000 00
000000 00000000
<0>[  626.360804] dfa0: ea3fdff4 ea3fdfb0 c03a91c0 c03a47f8 00000000 00000000 00
000000 00000000
<0>[  626.360838] dfc0: ead39cf4 dead4ead ffffffff ffffffff ea3fdfd0 ea3fdfd0 ea
d39cf4 c03a9134
<0>[  626.360871] dfe0: c034d4ec 00000013 00000000 ea3fdff8 c034d4ec c03a9140 3f
3b3f3


Thanks

Fan

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3] fs: Invalidate the cache for a parent block-device if fsync() is called for a partition
  2012-01-26 21:50               ` Mikulas Patocka
@ 2012-01-27 12:19                 ` Ric Wheeler
  0 siblings, 0 replies; 23+ messages in thread
From: Ric Wheeler @ 2012-01-27 12:19 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Christoph Hellwig, Andrew Morton, Niels de Vos, linux-fsdevel,
	linux-kernel, Al Viro, Jeff Moyer, Bryn M. Reeves

On 01/26/2012 04:50 PM, Mikulas Patocka wrote:
>
> On Thu, 26 Jan 2012, Christoph Hellwig wrote:
>
>> On Thu, Jan 26, 2012 at 01:40:51PM -0800, Andrew Morton wrote:
>>> The Right Thing To Do here is to make the kernel behave logically and
>>> predictably, then modify the userspace tools.  But if we're modifying
>>> the userspace tools then we would just change userspace to issue a
>>> BLKFLSBUF to /dev/sda and leave the kernel alone.
>> The right fix is to make partition and whole disk access coherent,
>> which is fairly simply:
>>
>>   - create the block device inode/mapping per gendisk, and only reference
>>     count it per block_device
>>   - make sure blkdev_get_block(s) applies the correct offset if used on
>>     partitions
> ... and what if you use 4kB blocksize and the partition start is not
> aligned on 4kB? (quite common case, because partitions are often aligned
> on 63 sectors) Then, you can't translate partition block numbers into disk
> block numbers.
>
> Mikulas
>

In your specific example, you cannot start a partition on a non-full block 
address since you cannot address it. Either the device uses 512 byte sectors, 
emulates 512 byte sectors or supports only 4096 byte sectors. In the first two 
case, not an issue.

In the last case, still not an issue. If you send an unaligned IO down to a 4096 
byte device, you get an IO error.

Also note that all (most?) user space tools have been fixed upstream (and in 
distros) to align properly when the device exports its block size. The new 
default alignment is 1MB from the start of disk, this should not be a concern.

Thanks!

Ric



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3] fs: Invalidate the cache for a parent block-device if fsync() is called for a partition
  2012-01-26 21:45             ` Christoph Hellwig
  2012-01-26 21:50               ` Mikulas Patocka
@ 2012-01-31 16:00               ` Niels de Vos
  2012-01-31 18:58                 ` Andrew Morton
  1 sibling, 1 reply; 23+ messages in thread
From: Niels de Vos @ 2012-01-31 16:00 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, linux-fsdevel, linux-kernel, Al Viro,
	Mikulas Patocka, Jeff Moyer, Bryn M. Reeves

On 01/26/2012 09:45 PM, Christoph Hellwig wrote:
> On Thu, Jan 26, 2012 at 01:40:51PM -0800, Andrew Morton wrote:
>> The Right Thing To Do here is to make the kernel behave logically and
>> predictably, then modify the userspace tools.  But if we're modifying
>> the userspace tools then we would just change userspace to issue a
>> BLKFLSBUF to /dev/sda and leave the kernel alone.
> 
> The right fix is to make partition and whole disk access coherent,
> which is fairly simply:
> 
>  - create the block device inode/mapping per gendisk, and only reference
>    count it per block_device
>  - make sure blkdev_get_block(s) applies the correct offset if used on
>    partitions
> 

This surely looks like a better way to fix this issue. I am not sure yet
how much work that would involve and if I am the right person to fix
this. If nobody beats me to it, I might send a patch for review some
(undefined) time later.

Thanks for the suggestions,
Niels

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3] fs: Invalidate the cache for a parent block-device if fsync() is called for a partition
  2012-01-31 16:00               ` Niels de Vos
@ 2012-01-31 18:58                 ` Andrew Morton
  2012-01-31 19:04                   ` Christoph Hellwig
  0 siblings, 1 reply; 23+ messages in thread
From: Andrew Morton @ 2012-01-31 18:58 UTC (permalink / raw)
  To: Niels de Vos
  Cc: Christoph Hellwig, linux-fsdevel, linux-kernel, Al Viro,
	Mikulas Patocka, Jeff Moyer, Bryn M. Reeves

On Tue, 31 Jan 2012 16:00:44 +0000
Niels de Vos <ndevos@redhat.com> wrote:

> On 01/26/2012 09:45 PM, Christoph Hellwig wrote:
> > On Thu, Jan 26, 2012 at 01:40:51PM -0800, Andrew Morton wrote:
> >> The Right Thing To Do here is to make the kernel behave logically and
> >> predictably, then modify the userspace tools.  But if we're modifying
> >> the userspace tools then we would just change userspace to issue a
> >> BLKFLSBUF to /dev/sda and leave the kernel alone.
> > 
> > The right fix is to make partition and whole disk access coherent,
> > which is fairly simply:
> > 
> >  - create the block device inode/mapping per gendisk, and only reference
> >    count it per block_device
> >  - make sure blkdev_get_block(s) applies the correct offset if used on
> >    partitions
> > 
> 
> This surely looks like a better way to fix this issue. I am not sure yet
> how much work that would involve and if I am the right person to fix
> this. If nobody beats me to it, I might send a patch for review some
> (undefined) time later.

One concern I have with the proposal is that it would forever rule out
support of >16T devices on 32-bit machines.

At present with 64-bit sector_t and 32-bit pgoff_t, I think we'd have a
reasonable chance of supporting, say, four 8T partitions on a 32T
device.  But if we were to switch the kernel from using four 4T
address_spaces (sda1-4) over to using a single 32T address_space (sda)
then we can rule it all out.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3] fs: Invalidate the cache for a parent block-device if fsync() is called for a partition
  2012-01-31 18:58                 ` Andrew Morton
@ 2012-01-31 19:04                   ` Christoph Hellwig
  2012-01-31 19:32                     ` Andrew Morton
  0 siblings, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2012-01-31 19:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Niels de Vos, Christoph Hellwig, linux-fsdevel, linux-kernel,
	Al Viro, Mikulas Patocka, Jeff Moyer, Bryn M. Reeves

On Tue, Jan 31, 2012 at 10:58:24AM -0800, Andrew Morton wrote:
> One concern I have with the proposal is that it would forever rule out
> support of >16T devices on 32-bit machines.
> 
> At present with 64-bit sector_t and 32-bit pgoff_t, I think we'd have a
> reasonable chance of supporting, say, four 8T partitions on a 32T
> device.  But if we were to switch the kernel from using four 4T
> address_spaces (sda1-4) over to using a single 32T address_space (sda)
> then we can rule it all out.

how do you plan to write the partition label in your hypothetic setup
if you can't open the main device?

And even if we solved that and people could create partitions on these
devices but not open the main device, or use large lvm volumes it would
be an absolutely major confusion.

So I really don't think your made up case matters.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3] fs: Invalidate the cache for a parent block-device if fsync() is called for a partition
  2012-01-31 19:04                   ` Christoph Hellwig
@ 2012-01-31 19:32                     ` Andrew Morton
  2012-01-31 19:37                       ` Christoph Hellwig
  0 siblings, 1 reply; 23+ messages in thread
From: Andrew Morton @ 2012-01-31 19:32 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Niels de Vos, linux-fsdevel, linux-kernel, Al Viro,
	Mikulas Patocka, Jeff Moyer, Bryn M. Reeves

On Tue, 31 Jan 2012 14:04:25 -0500
Christoph Hellwig <hch@infradead.org> wrote:

> On Tue, Jan 31, 2012 at 10:58:24AM -0800, Andrew Morton wrote:
> > One concern I have with the proposal is that it would forever rule out
> > support of >16T devices on 32-bit machines.
> > 
> > At present with 64-bit sector_t and 32-bit pgoff_t, I think we'd have a
> > reasonable chance of supporting, say, four 8T partitions on a 32T
> > device.  But if we were to switch the kernel from using four 4T
> > address_spaces (sda1-4) over to using a single 32T address_space (sda)
> > then we can rule it all out.
> 
> how do you plan to write the partition label in your hypothetic setup
> if you can't open the main device?
> 
> And even if we solved that and people could create partitions on these
> devices but not open the main device, or use large lvm volumes it would
> be an absolutely major confusion.
> 

I didn't say the kernel would support this as-is.

If the partitioning scheme requires writing to the individual
partitions then something would need to be done, such as a simple
offsetting DM driver.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3] fs: Invalidate the cache for a parent block-device if fsync() is called for a partition
  2012-01-31 19:32                     ` Andrew Morton
@ 2012-01-31 19:37                       ` Christoph Hellwig
  2012-01-31 19:48                         ` Andrew Morton
  0 siblings, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2012-01-31 19:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Hellwig, Niels de Vos, linux-fsdevel, linux-kernel,
	Al Viro, Mikulas Patocka, Jeff Moyer, Bryn M. Reeves

On Tue, Jan 31, 2012 at 11:32:50AM -0800, Andrew Morton wrote:
> I didn't say the kernel would support this as-is.
> 
> If the partitioning scheme requires writing to the individual
> partitions then something would need to be done, such as a simple
> offsetting DM driver.

Writing partition tables requires writing to them main block device.

Seriously - if people want to support block devices nodes > 16TB dealing
with this isn't the problem.  They'll need to find a way to do buffered
I/O without using the pagecache to get it right, at which point
blkdev_get_block in either form will simply go away.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3] fs: Invalidate the cache for a parent block-device if fsync() is called for a partition
  2012-01-31 19:37                       ` Christoph Hellwig
@ 2012-01-31 19:48                         ` Andrew Morton
  0 siblings, 0 replies; 23+ messages in thread
From: Andrew Morton @ 2012-01-31 19:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Niels de Vos, linux-fsdevel, linux-kernel, Al Viro,
	Mikulas Patocka, Jeff Moyer, Bryn M. Reeves

On Tue, 31 Jan 2012 14:37:48 -0500
Christoph Hellwig <hch@infradead.org> wrote:

> On Tue, Jan 31, 2012 at 11:32:50AM -0800, Andrew Morton wrote:
> > I didn't say the kernel would support this as-is.
> > 
> > If the partitioning scheme requires writing to the individual
> > partitions then something would need to be done, such as a simple
> > offsetting DM driver.
> 
> Writing partition tables requires writing to them main block device.

This can be done via an offsetting driver.

> Seriously - if people want to support block devices nodes > 16TB dealing
> with this isn't the problem.  They'll need to find a way to do buffered
> I/O without using the pagecache to get it right,

why?  I don't see the problem - supporting /dev/sdaX should be
straightforward.  /dev/sda rarely gets used and would need a bit of
special-case handling.

Please provide all the details as you see them and stop making me email
more questions to you.

> at which point
> blkdev_get_block in either form will simply go away.

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2012-01-31 19:48 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-20  1:58 [PATCH] block: Invalidate the cache for a parent block-device Mikulas Patocka
2012-01-20  9:35 ` Niels de Vos
2012-01-23 10:38   ` [PATCH v2] fs: Invalidate the cache for a parent block-device if fsync() is called for a partition Niels de Vos
2012-01-23 16:27     ` Jeff Moyer
2012-01-23 16:46       ` Niels de Vos
2012-01-23 19:23     ` Mikulas Patocka
2012-01-23 20:04     ` Jeff Moyer
2012-01-26 10:03     ` Andrew Morton
2012-01-26 11:50       ` Niels de Vos
2012-01-26 13:33         ` [PATCH v3] " Niels de Vos
2012-01-26 21:40           ` Andrew Morton
2012-01-26 21:45             ` Christoph Hellwig
2012-01-26 21:50               ` Mikulas Patocka
2012-01-27 12:19                 ` Ric Wheeler
2012-01-31 16:00               ` Niels de Vos
2012-01-31 18:58                 ` Andrew Morton
2012-01-31 19:04                   ` Christoph Hellwig
2012-01-31 19:32                     ` Andrew Morton
2012-01-31 19:37                       ` Christoph Hellwig
2012-01-31 19:48                         ` Andrew Morton
2012-01-26 21:49             ` Jeff Moyer
2012-01-26 22:13             ` Mikulas Patocka
2012-01-26 22:26               ` Kernel Oops report (Android gingerbread) Fan Zhang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).