All of lore.kernel.org
 help / color / mirror / Atom feed
* O_DIRECT reads appear to be cached on block device partition file?
@ 2010-09-14  3:49 Brett Russ
  2010-09-14  7:36 ` Dave Chinner
  0 siblings, 1 reply; 3+ messages in thread
From: Brett Russ @ 2010-09-14  3:49 UTC (permalink / raw)
  To: linux-kernel

Running a 2.6.31 kernel on a blade chassis system with multiple blades 
sharing common JBOD storage.  The application intelligently divides the 
drives up among the blades, but one blade in particular is charged with 
monitoring.  As part of this, this one monitoring blade can perform 
reads of a certain 512B sector of all disks in the system.  This sector 
is often written by other blades, these writes are sync'd to disk.  To 
work around the lack of cache coherency between the distinct blades, I'm 
using O_DIRECT on the monitoring blade such that it always reads from 
the media to get the latest copy of this sector.  The basic steps are:

# grab a 512B aligned buffer (use 4KB to be safe)
posix_memalign(&ptr, getpagesize(), 512B)
open(/dev/sdX3, O_RDONLY|O_DIRECT)
lseek(fd, offset, SEEK_SET)
read(fd, ptr, 512B)

If I run the above on the monitoring blade, then sync an update to the 
sector in question from another blade, then re-reun the above code on 
the monitoring blade, believe it or not I appear to be reading stale 
data.  If I use dd with iflag=direct, reading the same sector offset at 
the /dev/sdX3 partition file, I see the same stale data as seen from the 
code above.  If, however, I instead access this sector offset from the 
/dev/sdX device file using the (offset of partition 3 + offset of the 
sector) I see the intended data, which makes me believe some caching 
occurred locally for /dev/sdX3.

I've searched google for too long, coming up empty.  I've tried changing 
the open flags to O_RDWR|O_DIRECT in case the kernel was doing something 
special b/c it believed there to be no writing to this fd.  Perhaps 
there is an issue with reading only 512B rather than the 1KB Linux block 
size or 4KB page size?

Am I missing something about O_DIRECT, especially as it pertains to a 
device (and a partition on that device) file?  I thought NO caching 
would be involved.

Any clues appreciated,
thanks,
Brett


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: O_DIRECT reads appear to be cached on block device partition file?
  2010-09-14  3:49 O_DIRECT reads appear to be cached on block device partition file? Brett Russ
@ 2010-09-14  7:36 ` Dave Chinner
  2010-09-17 22:22   ` Brett Russ
  0 siblings, 1 reply; 3+ messages in thread
From: Dave Chinner @ 2010-09-14  7:36 UTC (permalink / raw)
  To: Brett Russ; +Cc: linux-kernel

On Mon, Sep 13, 2010 at 11:49:32PM -0400, Brett Russ wrote:
> Running a 2.6.31 kernel on a blade chassis system with multiple
> blades sharing common JBOD storage.  The application intelligently
> divides the drives up among the blades, but one blade in particular
> is charged with monitoring.  As part of this, this one monitoring
> blade can perform reads of a certain 512B sector of all disks in the
> system.  This sector is often written by other blades, these writes
> are sync'd to disk.  To work around the lack of cache coherency
> between the distinct blades, I'm using O_DIRECT on the monitoring
> blade such that it always reads from the media to get the latest
> copy of this sector.  The basic steps are:
> 
> # grab a 512B aligned buffer (use 4KB to be safe)
> posix_memalign(&ptr, getpagesize(), 512B)
> open(/dev/sdX3, O_RDONLY|O_DIRECT)
> lseek(fd, offset, SEEK_SET)
> read(fd, ptr, 512B)
> 
> If I run the above on the monitoring blade, then sync an update to
> the sector in question from another blade, then re-reun the above
> code on the monitoring blade, believe it or not I appear to be
> reading stale data.  If I use dd with iflag=direct, reading the same
> sector offset at the /dev/sdX3 partition file, I see the same stale
> data as seen from the code above.  If, however, I instead access
> this sector offset from the /dev/sdX device file using the (offset
> of partition 3 + offset of the sector) I see the intended data,
> which makes me believe some caching occurred locally for /dev/sdX3.

What does blktrace tell you?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: O_DIRECT reads appear to be cached on block device partition file?
  2010-09-14  7:36 ` Dave Chinner
@ 2010-09-17 22:22   ` Brett Russ
  0 siblings, 0 replies; 3+ messages in thread
From: Brett Russ @ 2010-09-17 22:22 UTC (permalink / raw)
  To: linux-kernel

Dave Chinner wrote:
> On Mon, Sep 13, 2010 at 11:49:32PM -0400, Brett Russ wrote:
>> If I run the above on the monitoring blade, then sync an update to
>> the sector in question from another blade, then re-reun the above
>> code on the monitoring blade, believe it or not I appear to be
>> reading stale data.  If I use dd with iflag=direct, reading the same
>> sector offset at the /dev/sdX3 partition file, I see the same stale
>> data as seen from the code above.  If, however, I instead access
>> this sector offset from the /dev/sdX device file using the (offset
>> of partition 3 + offset of the sector) I see the intended data,
>> which makes me believe some caching occurred locally for /dev/sdX3.
>
> What does blktrace tell you?

Thanks Dave for the pointer to blktrace.  I'd not used this before.

The short answer is that I now trust O_DIRECT.  The cause for me going 
down this path to begin with was caused by a stale cache in our application.

The longer answer of how my dd double-check could have gone wrong follows:

I've discovered that the start-of-partition LBA does not *always* agree 
between the kernel (reported by blktrace and sysfs) and utilities such 
as {fdisk|sfdisk}.  This means that my experiment of accessing the 
sector within the partition via the parent device may have been invalid, 
since I was trusting fdisk to determine the correct sector offset of the 
partition.

> spu0103# fdisk -l -u /dev/sdbk
...
> Units = sectors of 1 * 512 = 512 bytes
>
>     Device Boot      Start         End      Blocks  Id System
...
> /dev/sdbk3      1197742140  1944780704   373519282+ 83 Linux

> spu0103# sfdisk -uS -l /dev/sdbk
...
> Units = sectors of 512 bytes, counting from 0
>
>    Device Boot    Start       End   #sectors  Id  System
...
> /dev/sdbk3     1197742140 1944780704  747038565  83  Linux

> spu0103# cat /sys/block/sdbk/sdbk3/start
> 1197934920

The above discrepancy was also shown with blktrace:

> spu0103# blkparse -q 1
> Input file 1.blktrace.5 added
> Input file 1.blktrace.6 added
> Input file 1.blktrace.7 added
>

This command:

> spu0103# dd-7.1  if=/dev/sdbk3 bs=512 count=1 iflag=direct |hexdump -C

Produced this trace:

>  67,224  5        1     0.000000000 29726  A   R 1197934920 + 1 <- (67,227) 0

Note the kernel remapped the access to sdbk3 (offset 0) to sdbk (offset
1197934920) (see the major:minor numbers listed after the trace), which
is quite different from the partition start shown in fdisk of 1197742140.

>  67,224  5        2     0.000000564 29726  Q   R 1197934920 + 1 [dd-7.1]
>  67,224  5        3     0.000004032 29726  G   R 1197934920 + 1 [dd-7.1]
>  67,224  5        4     0.000006223 29726  P   N [dd-7.1]
>  67,224  5        5     0.000008152 29726  I   R 1197934920 + 1 [dd-7.1]
>  67,224  5        6     0.000009916 29726  U   N [dd-7.1] 1
>  67,224  5        7     0.000012286 29726  D   R 1197934920 + 1 [dd-7.1]
>  67,224  7        1     0.006802504     0  C   R 1197934920 + 1 [0]

And this command (accessing the start of partition using fdisk sector 
offset):

> spu0103# dd-7.1  if=/dev/sdbk skip=1197742140 bs=512 count=1 iflag=direct |hexdump -C

Produced this trace (as expected):

>  67,224  7        2    75.330506824 29924  Q   R 1197742140 + 1 [dd-7.1]
>  67,224  7        3    75.330509804 29924  G   R 1197742140 + 1 [dd-7.1]
>  67,224  7        4    75.330511985 29924  P   N [dd-7.1]
>  67,224  7        5    75.330513836 29924  I   R 1197742140 + 1 [dd-7.1]
>  67,224  7        6    75.330515495 29924  U   N [dd-7.1] 1
>  67,224  7        7    75.330517901 29924  D   R 1197742140 + 1 [dd-7.1]
>  67,224  6        1    75.340722638     0  C   R 1197742140 + 1 [0]

The aforementioned major/minor numbers:

> spu0103# ls -l /dev/|grep 67|grep '22[47]'
> brw-rw-rw-    1 root     root      67, 224 Sep 15 11:59 sdbk
> brw-rw-rw-    1 root     root      67, 227 Sep 15 11:59 sdbk3

*All* other drives in my system that I tested do show a match between 
the 3 methods above (fdisk, sfdisk, sysfs).

I don't know how this discrepancy with the partition start could have 
been introduced, but it is most likely a byproduct of my testing.

Thanks,
Brett


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2010-09-17 22:22 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-09-14  3:49 O_DIRECT reads appear to be cached on block device partition file? Brett Russ
2010-09-14  7:36 ` Dave Chinner
2010-09-17 22:22   ` Brett Russ

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.