All of lore.kernel.org
 help / color / mirror / Atom feed
* xfs: invalid requests to request_fn from xfs_repair
@ 2014-04-01 20:16 Jamie Pocas
  2014-04-01 20:42 ` Dave Chinner
  0 siblings, 1 reply; 6+ messages in thread
From: Jamie Pocas @ 2014-04-01 20:16 UTC (permalink / raw)
  To: xfs


[-- Attachment #1.1: Type: text/plain, Size: 5017 bytes --]

Hi folks,

I have a very simple block device driver that uses the request_fn style of
processing instead of the older bio handling or newer multiqueue approach.
I have been using this with ext3 and ext4 for years with no issues, but
scalability requirements have dictated that I move to xfs to better support
larger devices.

I'm observing something weird in my request_fn. It seems like the block
layer is issuing invalid requests to my request function, and it really
manifests when I use xfs_repair. Here's some info:

blk_queue_physical_block_size(q, 512) // should be no surprise
blk_queue_logical_block_size(q, 512) // should be no surprise
blk_queue_max_segments(q, 128); /* 128 memory segments (page +
offset/length pairs) per request! */
blk_queue_max_hw_sectors(q, CA_MAX_REQUEST_SECTORS); /* Up to 1024 sectors
(512k) per request hard limit in the kernel */
blk_queue_max_segment_size(q, CA_MAX_REQUEST_BYTES); /* 512k (1024 sectors)
is the hard limit in the kernel */

While iterating through segments in rq_for_each_segment(), for some
requests I am seeing some odd behavior.

segment 0: iter.bio->bi_sector = 0, blk_rq_cur_sectors(rq) = 903   // Ok,
this looks normal
segment 1: iter.bio->bi_sector = 1023, blk_rq_cur_sectors(rq) = 7 //
Whoah... this doesn't look right to me
...

You can see with segment 1, that the start sector is *NOT* adjacent to the
the previous segment's sectors (there's a gap from sector 903 through 1022)
and that the "sparse" request, for lack of a better term, extends beyond
the max I/O boundary of 512k. Furthermore, this doesn't seem to jibe with
what userspace is doing, which is a simple 512k read all in one chunk with
a single userspace address.

But when you look at the strace of what xfs_repair is doing, it's just an
innocuous read of 512k from sector 0.

write(2, "Phase 1 - find and verify superb"..., 40Phase 1 - find and verify
superblock...
) = 40
mmap(NULL, 528384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)
= 0x7f00e2f42000
mmap(NULL, 528384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)
= 0x7f00e2ec1000
lseek(4, 0, SEEK_SET)                   = 0
read(4, 0x7f00e2ec1200, 524288)         = -1 EIO (Input/output error)
write(2, "superblock read failed, offset 0"..., 61superblock read failed,
offset 0, size 524288, ag 0, rval -1
) = 61

The reason you see the EIO is because I am failing a request in the driver
since it violates the restrictions I set earlier, is non-adjacent, and so I
am unable to satisfy it.

*Point 1:* Shouldn't requests contain all segments that are adjacent on
disk e.g. if initially before the rq_for_each_segment() loop blk_rq_pos(rq)
is 10, and blk_rq_cur_sectors is 10, then on the next iteration (if any)
iter.bio->bi_sector should be 10+10-1=20? Is my understanding correct? Are
these some kind of special requests that should be handled differently
(e.g. I know that DISCARD requests have to be handled differently and
shouldn't be run through rq_for_each_segment, and that FLUSH requests are
often empty). The cmd_flags say that they are normal REQ_TYPE_FS requests.

*Point 2:* If I ignore the incorrect iter.bio->bi_sector, and just
read/write the request out as if it were adjacent, I xfs_repair reports
corruption, and sure enough there are inodes which are zeroed out instead
of having the inode magic 0x494e ( "IN") as expected. So mkfs.xfs, while
not sending what appear to be illegal requests, is still resulting in
corruption.

*Point 3:* Interestingly this goes away when I set
blk_queue_max_segments(q, 1), but this obviously cuts down on clustering,
and this of course kills performance. Is this indicative of anything in
particular that I could be doing wrong?

Please cut me some slack when I say something like xfs_repair is "sending"
invalid requests. I know that there is the C library, system call
interface, block layer, etc.. in between, but I just mean to say simply
that using this tool results in this unexpected behavior. I don't mean to
point blame at xfs or xfsprogs. If this turns out to be a block layer
issue, and this posting needs to be sent elsewhere, I apologize and would
appreciate being pointed in the right direction.

It almost feels like the block layer is splitting the bios up wrongly, is
corrupting the bvecs, or is introducing a race. What's strange again, is
that I have only seen this behavior with xfs tools, but not ext3, or ext4
and e2fsprogs which has been working for years. It really shouldn't matter
though, because mkfs.xfs and xfs_repair are user space tools, so this
shouldn't cause the block layer in the kernel to send down invalid
requests. I have been grappling with this for a few weeks, and I am tempted
to go to the old bio handling function instead just to see if that would
work out for me better, but that would be a big rewrite of the LLD. I am
using an older Ubuntu 12.04 kernel 3.2.x so I am not able to go to the
newer multiqueue implementation.


Any ideas/suggestions?
Need more information?


Thanks and Regards,
Jamie Pocas

[-- Attachment #1.2: Type: text/html, Size: 6265 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: xfs: invalid requests to request_fn from xfs_repair
  2014-04-01 20:16 xfs: invalid requests to request_fn from xfs_repair Jamie Pocas
@ 2014-04-01 20:42 ` Dave Chinner
  2014-04-01 22:28   ` Jamie Pocas
  0 siblings, 1 reply; 6+ messages in thread
From: Dave Chinner @ 2014-04-01 20:42 UTC (permalink / raw)
  To: Jamie Pocas; +Cc: xfs

On Tue, Apr 01, 2014 at 04:16:39PM -0400, Jamie Pocas wrote:
> Hi folks,
> 
> I have a very simple block device driver that uses the request_fn style of
> processing instead of the older bio handling or newer multiqueue approach.
> I have been using this with ext3 and ext4 for years with no issues, but
> scalability requirements have dictated that I move to xfs to better support
> larger devices.
> 
> I'm observing something weird in my request_fn. It seems like the block
> layer is issuing invalid requests to my request function, and it really
> manifests when I use xfs_repair. Here's some info:
> 
> blk_queue_physical_block_size(q, 512) // should be no surprise
> blk_queue_logical_block_size(q, 512) // should be no surprise

512 byte sectors.

> blk_queue_max_segments(q, 128); /* 128 memory segments (page +
> offset/length pairs) per request! */
> blk_queue_max_hw_sectors(q, CA_MAX_REQUEST_SECTORS); /* Up to 1024 sectors
> (512k) per request hard limit in the kernel */
> blk_queue_max_segment_size(q, CA_MAX_REQUEST_BYTES); /* 512k (1024 sectors)
> is the hard limit in the kernel */

And up to 512KB per IO.

> While iterating through segments in rq_for_each_segment(), for some
> requests I am seeing some odd behavior.
> 
> segment 0: iter.bio->bi_sector = 0, blk_rq_cur_sectors(rq) = 903   // Ok,
> this looks normal
> segment 1: iter.bio->bi_sector = 1023, blk_rq_cur_sectors(rq) = 7 //
> Whoah... this doesn't look right to me

Seems fine to me. There's absolutely no reason two separate IOs
can't be sub-page sector aligned or discontiguous given the above
configuration. If that's what the getblocks callback returned to the
DIO layer, then that's what you're going to see in the bios...

> You can see with segment 1, that the start sector is *NOT*
> adjacent to the the previous segment's sectors (there's a gap from
> sector 903 through 1022) and that the "sparse" request, for lack
> of a better term, extends beyond the max I/O boundary of 512k.
> Furthermore, this doesn't seem to jibe with what userspace is
> doing, which is a simple 512k read all in one chunk with a single
> userspace address.

The read syscall is for a byte offset (from the fd, set by lseek)
and a length, not a range of contiguous sectors on the device. That
off/len tuple gets mapped by the underlying filesystem or device
into an sector/len via a getblocks callback in the dio code and the
bios are then built according to the mappings that are returned. So
in many cases the IO that hits the block device looks nothing at all
like the IO that came from userspace.


> But when you look at the strace of what xfs_repair is doing, it's just an
> innocuous read of 512k from sector 0.
> 
> write(2, "Phase 1 - find and verify superb"..., 40Phase 1 - find and verify
> superblock...
> ) = 40
> mmap(NULL, 528384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)
> = 0x7f00e2f42000
> mmap(NULL, 528384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)
> = 0x7f00e2ec1000
> lseek(4, 0, SEEK_SET)                   = 0
> read(4, 0x7f00e2ec1200, 524288)         = -1 EIO (Input/output error)
> write(2, "superblock read failed, offset 0"..., 61superblock read failed,
> offset 0, size 524288, ag 0, rval -1
> ) = 61

This is mostly meaningless without the command line you used for
xfs_repair and the trace from the open() syscall parameters that
returned fd 4 because we have no idea what the IO context actually
is.

> The reason you see the EIO is because I am failing a request in the driver
> since it violates the restrictions I set earlier, is non-adjacent, and so I
> am unable to satisfy it.
> 
> *Point 1:* Shouldn't requests contain all segments that are adjacent on
> disk

Not necessarily, see above.

> *Point 2:* If I ignore the incorrect iter.bio->bi_sector, and just
> read/write the request out as if it were adjacent, I xfs_repair reports
> corruption,

Of course, because you read data from different sectors than was
asked for by the higher layers.

> and sure enough there are inodes which are zeroed out instead
> of having the inode magic 0x494e ( "IN") as expected. So mkfs.xfs, while
> not sending what appear to be illegal requests, is still resulting in
> corruption.
> 
> *Point 3:* Interestingly this goes away when I set
> blk_queue_max_segments(q, 1), but this obviously cuts down on clustering,
> and this of course kills performance. Is this indicative of anything in
> particular that I could be doing wrong?

Probably does, but I can't tell you what it may be...

> Please cut me some slack when I say something like xfs_repair is "sending"
> invalid requests. I know that there is the C library, system call
> interface, block layer, etc.. in between, but I just mean to say simply
> that using this tool results in this unexpected behavior. I don't mean to
> point blame at xfs or xfsprogs. If this turns out to be a block layer
> issue, and this posting needs to be sent elsewhere, I apologize and would
> appreciate being pointed in the right direction.
> 
> It almost feels like the block layer is splitting the bios up wrongly, is
> corrupting the bvecs, or is introducing a race. What's strange again, is
> that I have only seen this behavior with xfs tools, but not ext3, or ext4
> and e2fsprogs which has been working for years. It really shouldn't matter

Because the XFS tools use direct IO, and the ext tools don't.
Therefore the IO that the different tools pass through are completely
different code paths in the kernel that have different constraints.
e.g. buffered IO will always be page aligned, direct IO can be
sector aligned....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: xfs: invalid requests to request_fn from xfs_repair
  2014-04-01 20:42 ` Dave Chinner
@ 2014-04-01 22:28   ` Jamie Pocas
  2014-04-02  4:47     ` Dave Chinner
  0 siblings, 1 reply; 6+ messages in thread
From: Jamie Pocas @ 2014-04-01 22:28 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs


[-- Attachment #1.1: Type: text/plain, Size: 10547 bytes --]

I have to say, thanks very much for the lightning fast response. Comments
are inline. I think the punchline is going to end up being that I probably
need to learn some more about the effects of O_DIRECT on the I/O path and
how it ends up in the request_queue. It's tempting now to regress back to a
bio-handling function, do my own merging, and see if I can avoid the
problem, but I am stubborn so I will probably stick it out for as long as I
can.

On Tue, Apr 1, 2014 at 4:42 PM, Dave Chinner <david@fromorbit.com> wrote:

> On Tue, Apr 01, 2014 at 04:16:39PM -0400, Jamie Pocas wrote:
> > blk_queue_physical_block_size(q, 512) // should be no surprise
> > blk_queue_logical_block_size(q, 512) // should be no surprise
>
> 512 byte sectors.
>
> > blk_queue_max_segments(q, 128); /* 128 memory segments (page +
> > offset/length pairs) per request! */
> > blk_queue_max_hw_sectors(q, CA_MAX_REQUEST_SECTORS); /* Up to 1024
> sectors
> > (512k) per request hard limit in the kernel */
> > blk_queue_max_segment_size(q, CA_MAX_REQUEST_BYTES); /* 512k (1024
> sectors)
> > is the hard limit in the kernel */
>
> And up to 512KB per IO.
>
> > While iterating through segments in rq_for_each_segment(), for some
> > requests I am seeing some odd behavior.
> >
> > segment 0: iter.bio->bi_sector = 0, blk_rq_cur_sectors(rq) = 903   // Ok,
> > this looks normal
> > segment 1: iter.bio->bi_sector = 1023, blk_rq_cur_sectors(rq) = 7 //
> > Whoah... this doesn't look right to me
>
> Seems fine to me. There's absolutely no reason two separate IOs
> can't be sub-page sector aligned or discontiguous given the above
> configuration. If that's what the getblocks callback returned to the
> DIO layer, then that's what you're going to see in the bios...
>
>
I guess my assumption that sectors would be adjacent was invalid, but this
is what I have read in ldd3 and other books and seen in some driver
examples. So I apologize for my folly and it's another lesson in not
believing everything you see on the net. The back half of my driver is
really optimized for transferring a single contiguous lba & transfer length
pair (ala SCSI WRITE 16 for example). However in desperation, I had changed
the driver to support requests like this (with disjoint sector ranges) but
it still seemed odd to me that occasionally some requests would even have
multiple segments that would *write* the same sectors/lengths. For example
I also see requests like the following.

Given: rq_data_dir(rq) == 1 (i.e. a write)
segment   0: iter.bio->bi_sector = 0, blk_rq_cur_sectors(rq) = 7    // Ok
segment   1: iter.bio->bi_sector = 0, blk_rq_cur_sectors(rq) = 7    //
Again?
segment   2: iter.bio->bi_sector = 0, blk_rq_cur_sectors(rq) = 7    // Yet
again?
segment   3: iter.bio->bi_sector = 0, blk_rq_cur_sectors(rq) = 7    //
Really?
...
segment 126: iter.bio->bi_sector = 0, blk_rq_cur_sectors(rq) = 7    //
What's going on?
// Reminder: segment limit is 128
segment 127: iter.bio->bi_sector = 1023, blk_rq_cur_sectors(rq) = 1 //
Haha, bet you though I was going to say sector = 0!

This makes sense for reads, since maybe multiple user buffers wanted to
read the same sectors at [around] the same time so it batched them in one
request (temporal + spatial locality), but it doesn't make a lot of sense
to me for writes. That seems inefficient at best. Wouldn't the last write
win?

Even still, I occasionally see requests with more segments than the max
(perhaps I am counting segments wrong too). This was commented on by Jens
Axboe years ago here, but still seems to happen on my Ubuntu 12.04 w/ 3.2
kernel occasionally: https://lkml.org/lkml/2009/2/4/465

> You can see with segment 1, that the start sector is *NOT*
> > adjacent to the the previous segment's sectors (there's a gap from
> > sector 903 through 1022) and that the "sparse" request, for lack
> > of a better term, extends beyond the max I/O boundary of 512k.
> > Furthermore, this doesn't seem to jibe with what userspace is
> > doing, which is a simple 512k read all in one chunk with a single
> > userspace address.
>
> The read syscall is for a byte offset (from the fd, set by lseek)
> and a length, not a range of contiguous sectors on the device. That
> off/len tuple gets mapped by the underlying filesystem or device
> into an sector/len via a getblocks callback in the dio code and the
> bios are then built according to the mappings that are returned. So
> in many cases the IO that hits the block device looks nothing at all
> like the IO that came from userspace.
>
>
In general if I were reading a file I agree 100%, but this was the read
call issued by xfs_repair to read the superblock directly from a block
device. So I, perhaps wrongly, anticipated straightforward requests or
maybe even a single 512k request coming down to the request_fn. I should
have mentioned that this was observed while using xfs_repair earlier in the
email though, so no offense intended. There's more of the xfs_repair strace
in my response to your next comment below. It fails very early as you can
imagine when xfs_repair can't read the superblock.


> > But when you look at the strace of what xfs_repair is doing, it's just an
> > innocuous read of 512k from sector 0.
> >
> > write(2, "Phase 1 - find and verify superb"..., 40Phase 1 - find and
> verify
> > superblock...
> > ) = 40
> > mmap(NULL, 528384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
> 0)
> > = 0x7f00e2f42000
> > mmap(NULL, 528384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
> 0)
> > = 0x7f00e2ec1000
> > lseek(4, 0, SEEK_SET)                   = 0
> > read(4, 0x7f00e2ec1200, 524288)         = -1 EIO (Input/output error)
> > write(2, "superblock read failed, offset 0"..., 61superblock read failed,
> > offset 0, size 524288, ag 0, rval -1
> > ) = 61
>
> This is mostly meaningless without the command line you used for
> xfs_repair and the trace from the open() syscall parameters that
> returned fd 4 because we have no idea what the IO context actually
> is.
>
>
Yes, sorry about that I was trying to not bloat the thread with all the
localizations loading noise. I'm going to try to strip some of those out
and post more of the strace below. I *do* see the O_DIRECT flag as
expected. Here's the block device being opened. All I did was "xfs_repair
-n /dev/tsca0", hence the O_RDONLY flag.

open("/dev/tsca0", O_RDONLY|O_DIRECT)   = 4
fstat(4, {st_mode=S_IFBLK|0660, st_rdev=makedev(251, 0), ...}) = 0
fstat(4, {st_mode=S_IFBLK|0660, st_rdev=makedev(251, 0), ...}) = 0
ioctl(4, BLKGETSIZE64, 0x7fff8f759270)  = 0
ioctl(4, BLKSSZGET, 0x6732e0)           = 0
chdir("/home/jpocas/source/xfs/drivers") = 0
brk(0x23f9000)                          = 0x23f9000
close(3)                                = 0
getrlimit(RLIMIT_FSIZE, {rlim_cur=RLIM_INFINITY, rlim_max=RLIM_INFINITY}) =
0
open("/usr/share/locale/locale.alias", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=2570, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
0x7f00e2fc2000
read(3, "# Locale name alias data base.\n#"..., 4096) = 2570
read(3, "", 4096)                       = 0
close(3)                                = 0
munmap(0x7f00e2fc2000, 4096)            = 0
write(2, "Phase 1 - find and verify superb"..., 40Phase 1 - find and verify
superblock...
) = 40
mmap(NULL, 528384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)
= 0x7f00e2f42000
mmap(NULL, 528384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)
= 0x7f00e2ec1000
lseek(4, 0, SEEK_SET)                   = 0
read(4, 0x7f00e2ec1200, 524288)         = -1 EIO (Input/output error)
write(2, "superblock read failed, offset 0"..., 61superblock read failed,
offset 0, size 524288, ag 0, rval -1
) = 61
write(2, "\nfatal error -- ", 16
fatal error -- )       = 16
write(2, "Input/output error\n", 19Input/output error
)    = 19
exit_group(1)                           = ?



> > The reason you see the EIO is because I am failing a request in the
> driver
> > since it violates the restrictions I set earlier, is non-adjacent, and
> so I
> > am unable to satisfy it.
> >
> > *Point 1:* Shouldn't requests contain all segments that are adjacent on
> > disk
>
> Not necessarily, see above.
>
>
Got it. I need to look at the DIO code path in the kernel.


> > *Point 2:* If I ignore the incorrect iter.bio->bi_sector, and just
> > read/write the request out as if it were adjacent, I xfs_repair reports
> > corruption,
>
> Of course, because you read data from different sectors than was
> asked for by the higher layers.
>
>
Sorry, this was just out of desperation and another data point, ridiculous
as it may sound. I have been looking at this for a couple weeks now so I am
taking wild shots in the dark.


> > and sure enough there are inodes which are zeroed out instead
> > of having the inode magic 0x494e ( "IN") as expected. So mkfs.xfs, while
> > not sending what appear to be illegal requests, is still resulting in
> > corruption.
> >
> > *Point 3:* Interestingly this goes away when I set
> > blk_queue_max_segments(q, 1), but this obviously cuts down on clustering,
> > and this of course kills performance. Is this indicative of anything in
> > particular that I could be doing wrong?
>
> Probably does, but I can't tell you what it may be...
>
> > Please cut me some slack when I say something like xfs_repair is
> "sending"
> > invalid requests. I know that there is the C library, system call
> > interface, block layer, etc.. in between, but I just mean to say simply
> > that using this tool results in this unexpected behavior. I don't mean to
> > point blame at xfs or xfsprogs. If this turns out to be a block layer
> > issue, and this posting needs to be sent elsewhere, I apologize and would
> > appreciate being pointed in the right direction.
> >
> > It almost feels like the block layer is splitting the bios up wrongly, is
> > corrupting the bvecs, or is introducing a race. What's strange again, is
> > that I have only seen this behavior with xfs tools, but not ext3, or ext4
> > and e2fsprogs which has been working for years. It really shouldn't
> matter
>
> Because the XFS tools use direct IO, and the ext tools don't.
> Therefore the IO that the different tools pass through are completely
> different code paths in the kernel that have different constraints.
> e.g. buffered IO will always be page aligned, direct IO can be
> sector aligned....
>
>
That makes sense and I do see that O_DIRECT is being used by both mkfs.xfs
and xfs_repair. I guess I am off to investigate the O_DIRECT code path.

[-- Attachment #1.2: Type: text/html, Size: 13662 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: xfs: invalid requests to request_fn from xfs_repair
  2014-04-01 22:28   ` Jamie Pocas
@ 2014-04-02  4:47     ` Dave Chinner
  2014-04-05 18:19       ` Jamie Pocas
  0 siblings, 1 reply; 6+ messages in thread
From: Dave Chinner @ 2014-04-02  4:47 UTC (permalink / raw)
  To: Jamie Pocas; +Cc: xfs

On Tue, Apr 01, 2014 at 06:28:19PM -0400, Jamie Pocas wrote:
> I have to say, thanks very much for the lightning fast response. Comments
> are inline. I think the punchline is going to end up being that I probably
> need to learn some more about the effects of O_DIRECT on the I/O path and
> how it ends up in the request_queue. It's tempting now to regress back to a
> bio-handling function, do my own merging, and see if I can avoid the
> problem, but I am stubborn so I will probably stick it out for as long as I
> can.
> 
> On Tue, Apr 1, 2014 at 4:42 PM, Dave Chinner <david@fromorbit.com> wrote:
> 
> > On Tue, Apr 01, 2014 at 04:16:39PM -0400, Jamie Pocas wrote:
> > > blk_queue_physical_block_size(q, 512) // should be no surprise
> > > blk_queue_logical_block_size(q, 512) // should be no surprise
> >
> > 512 byte sectors.
> >
> > > blk_queue_max_segments(q, 128); /* 128 memory segments (page +
> > > offset/length pairs) per request! */
> > > blk_queue_max_hw_sectors(q, CA_MAX_REQUEST_SECTORS); /* Up to 1024
> > sectors
> > > (512k) per request hard limit in the kernel */
> > > blk_queue_max_segment_size(q, CA_MAX_REQUEST_BYTES); /* 512k (1024
> > sectors)
> > > is the hard limit in the kernel */
> >
> > And up to 512KB per IO.
> >
> > > While iterating through segments in rq_for_each_segment(), for some
> > > requests I am seeing some odd behavior.
> > >
> > > segment 0: iter.bio->bi_sector = 0, blk_rq_cur_sectors(rq) = 903   // Ok,
> > > this looks normal
> > > segment 1: iter.bio->bi_sector = 1023, blk_rq_cur_sectors(rq) = 7 //
> > > Whoah... this doesn't look right to me
> >
> > Seems fine to me. There's absolutely no reason two separate IOs
> > can't be sub-page sector aligned or discontiguous given the above
> > configuration. If that's what the getblocks callback returned to the
> > DIO layer, then that's what you're going to see in the bios...
> >
> >
> I guess my assumption that sectors would be adjacent was invalid, but this
> is what I have read in ldd3 and other books and seen in some driver
> examples. So I apologize for my folly and it's another lesson in not
> believing everything you see on the net. The back half of my driver is
> really optimized for transferring a single contiguous lba & transfer length
> pair (ala SCSI WRITE 16 for example). However in desperation, I had changed
> the driver to support requests like this (with disjoint sector ranges) but
> it still seemed odd to me that occasionally some requests would even have
> multiple segments that would *write* the same sectors/lengths. For example
> I also see requests like the following.
> 
> Given: rq_data_dir(rq) == 1 (i.e. a write)
> segment   0: iter.bio->bi_sector = 0, blk_rq_cur_sectors(rq) = 7    // Ok
> segment   1: iter.bio->bi_sector = 0, blk_rq_cur_sectors(rq) = 7    //
> Again?
> segment   2: iter.bio->bi_sector = 0, blk_rq_cur_sectors(rq) = 7    // Yet
> again?
> segment   3: iter.bio->bi_sector = 0, blk_rq_cur_sectors(rq) = 7    //
> Really?
> ...
> segment 126: iter.bio->bi_sector = 0, blk_rq_cur_sectors(rq) = 7    //
> What's going on?
> // Reminder: segment limit is 128
> segment 127: iter.bio->bi_sector = 1023, blk_rq_cur_sectors(rq) = 1 //
> Haha, bet you though I was going to say sector = 0!

That really looks to me like a bug in whatever is building that iter
structure. It looks like there's an off-by-one on each page (should
be 8 sectors not 7) and the bi_sector is not being set correctly to
the sector offset of the bio within the IO. i.e. for a contiguous
IO it should look something like:

segment   0: iter.bio->bi_sector = 0, blk_rq_cur_sectors(rq) = 8
segment   1: iter.bio->bi_sector = 8, blk_rq_cur_sectors(rq) = 8
segment   2: iter.bio->bi_sector = 16, blk_rq_cur_sectors(rq) = 8
....
segment 127: iter.bio->bi_sector = 1016, blk_rq_cur_sectors(rq) = 8

I highly doubt the direct IO code is doing something wrong, because
a mapping bug like that would show up everywhere and not just in your
special driver....

> > The read syscall is for a byte offset (from the fd, set by lseek)
> > and a length, not a range of contiguous sectors on the device. That
> > off/len tuple gets mapped by the underlying filesystem or device
> > into an sector/len via a getblocks callback in the dio code and the
> > bios are then built according to the mappings that are returned. So
> > in many cases the IO that hits the block device looks nothing at all
> > like the IO that came from userspace.
> >
> >
> In general if I were reading a file I agree 100%, but this was the read
> call issued by xfs_repair to read the superblock directly from a block
> device. So I, perhaps wrongly, anticipated straightforward requests or
> maybe even a single 512k request coming down to the request_fn. I should
> have mentioned that this was observed while using xfs_repair earlier in the
> email though, so no offense intended. There's more of the xfs_repair strace
> in my response to your next comment below. It fails very early as you can
> imagine when xfs_repair can't read the superblock.

I figured that this is what you were talking about, but remember
that the block device also does it's own offset->sector mapping in
the direct IO layer. So it's entirely possible that the mapping
function (i.e the getblocks callback) is broken and that is why you
are seeing weird mappings in the bios being built by the direct Io
code.

> and post more of the strace below. I *do* see the O_DIRECT flag as
> expected. Here's the block device being opened. All I did was "xfs_repair
> -n /dev/tsca0", hence the O_RDONLY flag.

That all seems ok from what I'd expect from xfsrepair....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: xfs: invalid requests to request_fn from xfs_repair
  2014-04-02  4:47     ` Dave Chinner
@ 2014-04-05 18:19       ` Jamie Pocas
  2014-04-08  3:56         ` Jamie Pocas
  0 siblings, 1 reply; 6+ messages in thread
From: Jamie Pocas @ 2014-04-05 18:19 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs


[-- Attachment #1.1: Type: text/plain, Size: 5537 bytes --]

On Wed, Apr 2, 2014 at 12:47 AM, Dave Chinner <david@fromorbit.com> wrote:

> > > > While iterating through segments in rq_for_each_segment(), for some
> > > > requests I am seeing some odd behavior.
> > > >
> > > > segment 0: iter.bio->bi_sector = 0, blk_rq_cur_sectors(rq) = 903
> // Ok,
> > > > this looks normal
> > > > segment 1: iter.bio->bi_sector = 1023, blk_rq_cur_sectors(rq) = 7 //
> > > > Whoah... this doesn't look right to me
> > >
> > > Seems fine to me. There's absolutely no reason two separate IOs
> > > can't be sub-page sector aligned or discontiguous given the above
> > > configuration. If that's what the getblocks callback returned to the
> > > DIO layer, then that's what you're going to see in the bios...
> > >
> > >
> > I guess my assumption that sectors would be adjacent was invalid, but
> this
> > is what I have read in ldd3 and other books and seen in some driver
> > examples. So I apologize for my folly and it's another lesson in not
> > believing everything you see on the net. The back half of my driver is
> > really optimized for transferring a single contiguous lba & transfer
> length
> > pair (ala SCSI WRITE 16 for example). However in desperation, I had
> changed
> > the driver to support requests like this (with disjoint sector ranges)
> but
> > it still seemed odd to me that occasionally some requests would even have
> > multiple segments that would *write* the same sectors/lengths. For
> example
> > I also see requests like the following.
> >
> > Given: rq_data_dir(rq) == 1 (i.e. a write)
> > segment   0: iter.bio->bi_sector = 0, blk_rq_cur_sectors(rq) = 7    // Ok
> > segment   1: iter.bio->bi_sector = 0, blk_rq_cur_sectors(rq) = 7    //
> > Again?
> > segment   2: iter.bio->bi_sector = 0, blk_rq_cur_sectors(rq) = 7    //
> Yet
> > again?
> > segment   3: iter.bio->bi_sector = 0, blk_rq_cur_sectors(rq) = 7    //
> > Really?
> > ...
> > segment 126: iter.bio->bi_sector = 0, blk_rq_cur_sectors(rq) = 7    //
> > What's going on?
> > // Reminder: segment limit is 128
> > segment 127: iter.bio->bi_sector = 1023, blk_rq_cur_sectors(rq) = 1 //
> > Haha, bet you though I was going to say sector = 0!
>
> That really looks to me like a bug in whatever is building that iter
> structure. It looks like there's an off-by-one on each page (should
> be 8 sectors not 7) and the bi_sector is not being set correctly to
> the sector offset of the bio within the IO. i.e. for a contiguous
> IO it should look something like:
>
> segment   0: iter.bio->bi_sector = 0, blk_rq_cur_sectors(rq) = 8
> segment   1: iter.bio->bi_sector = 8, blk_rq_cur_sectors(rq) = 8
> segment   2: iter.bio->bi_sector = 16, blk_rq_cur_sectors(rq) = 8
> ....
> segment 127: iter.bio->bi_sector = 1016, blk_rq_cur_sectors(rq) = 8
>
> I highly doubt the direct IO code is doing something wrong, because
> a mapping bug like that would show up everywhere and not just in your
> special driver....
>
>
I agree that the iterator *looks* wrong, Like I said earlier, I am just
using the rq_for_each_segment() macro (defined in include/linux/blkdev.h
and expands to a few dozen lines of code). This is the recommended way of
iterating requests instead of using the lower level bio iterating functions
or going it yourself which is why it's recommended to use that instead of
going deeper and breaking compatibility when the kernel changes. I suspect
it's the request itself which was constructed incorrectly, or contains bios
that should not have been grouped into the same request, or contains bios
that have their bi_sector set incorrectly.


> > > The read syscall is for a byte offset (from the fd, set by lseek)
> > > and a length, not a range of contiguous sectors on the device. That
> > > off/len tuple gets mapped by the underlying filesystem or device
> > > into an sector/len via a getblocks callback in the dio code and the
> > > bios are then built according to the mappings that are returned. So
> > > in many cases the IO that hits the block device looks nothing at all
> > > like the IO that came from userspace.
> > >
> > >
> > In general if I were reading a file I agree 100%, but this was the read
> > call issued by xfs_repair to read the superblock directly from a block
> > device. So I, perhaps wrongly, anticipated straightforward requests or
> > maybe even a single 512k request coming down to the request_fn. I should
> > have mentioned that this was observed while using xfs_repair earlier in
> the
> > email though, so no offense intended. There's more of the xfs_repair
> strace
> > in my response to your next comment below. It fails very early as you can
> > imagine when xfs_repair can't read the superblock.
>
> I figured that this is what you were talking about, but remember
> that the block device also does it's own offset->sector mapping in
> the direct IO layer. So it's entirely possible that the mapping
> function (i.e the getblocks callback) is broken and that is why you
> are seeing weird mappings in the bios being built by the direct Io
> code.
>
>
This is the xfs_repair tool being run directly against a block device, so
if I am reading the fs/direct-io.c code correctly this means that the
mapping function (get_block_t) is using fs/block_dev.c:blkdev_get_block().
Is that correct? That function is very short and simple, so I can't imagine
how there would be a bug in there. If it were a file on an xfs filesystem,
the get_block_t callback would be xfs_get_blocks or xfs_get_blocks_direct
(which both internally call __xfs_get_blocks). Is my understanding correct?

[-- Attachment #1.2: Type: text/html, Size: 6802 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: xfs: invalid requests to request_fn from xfs_repair
  2014-04-05 18:19       ` Jamie Pocas
@ 2014-04-08  3:56         ` Jamie Pocas
  0 siblings, 0 replies; 6+ messages in thread
From: Jamie Pocas @ 2014-04-08  3:56 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs


[-- Attachment #1.1: Type: text/plain, Size: 8872 bytes --]

Hi Dave,

Okay, I think I've found a workaround but I don't have the root cause yet.
So just to recap there is a macro defined in include/linux/blkdev.h
called rq_for_each_segment(bvec, rq, iter). The arguments are of type
struct bio_vec*, struct request*, and struct req_iterator respectively. The
macro expands to several other macro invocations, which eventually expand
to approx. 40-50 lines of code.

(NOTE: in newer kernels the first argument is a bio_vec, and not a
bio_vec*. I think this is due to the immutable bio_vec changes, but I could
be wrong. For more on that, see http://lwn.net/Articles/544713/ , but I
digress.)

It seems like in some instances when using O_DIRECT and iterating through
the request using the above macro, I see that some fields of iter.bio are
not updated to the correct value and macros like blk_rq_cur_sectors(),
which ultimately rely on rq->bio, thus do not return the correct values. If
I assume (<--- dangerous word) that the sectors are all adjacent, as they
really are supposed to be in all my investigation, and use only the bio_vec
fields to track progress then the problems go away. I have been running all
sorts of data consistency tests since this discovery and I don't have any
further problems so far. Also, xfs_repair is returning cleanly as expected
now.

So basically I am only using blk_rq_pos(rq) once before the loop to get the
start sector, and then only relying on fields of bvec after that. For
example if I need to count the number of sectors, I can accumulate
(bvec->bv_len >> 9) through all the iterations. Summing up all of
blk_req_curr_sectors(rq) only worked sometimes, and so was unreliable.
Basically, the code to walk the request * looks something like this now,
where it used to make use of iter.bio->bi_sector and blk_rq_cur_sectors.

    sector_t curr_sector = blk_rq_pos(req);
    unsigned curr_num_sectors = 0;
    unsigned total_sectors = 0;

    // Should not do the memory mapping on a discard request, payload is
invalid
    BUG_ON(unlikely(req->cmd_flags & REQ_DISCARD));

    /* Iterate through the request */
    rq_for_each_segment(bvec, req, iter) {
        kern_buffer = kmap(bvec->bv_page) + bvec->bv_offset;
        curr_sector += curr_num_sectors;
        curr_num_sectors = (bvec->bv_len >> KERNEL_SECTOR_SHIFT);

        /* Do the transfer to the device. This part is not really relevant
to the discussion here
           but you can just imagine it as a memcpy because that's
effectively what's happening
           a failed copy would return -EFAULT and eventually the request
would be ended with
           blk_end_request_all with an error code*/

        kunmap(bvec->bv_page);

        /* FAIL */
        if (ret) {
            return ret;
        }

        total_sectors += curr_num_sectors; // Just for accounting
    }

I am going to run the compiler with -E and examine the
rq_for_each_segment() macro expansion to see where the suspected problem
happens. If it does look like a bug, I am going to write to LKML. Perhaps
it's something I am supposed to implicitly understand that is not well
documented (like the rest of O_DIRECT), and they can set me straight.

As it stands this doesn't seem to be an xfs issue so I'm going to drop this
discussion unless anyone else is interested in this issue or wants more
info.

Regards,
Jamie



On Wed, Apr 2, 2014 at 12:47 AM, Dave Chinner <david@fromorbit.com> wrote:

> > > > While iterating through segments in rq_for_each_segment(), for some
> > > > requests I am seeing some odd behavior.
> > > >
> > > > segment 0: iter.bio->bi_sector = 0, blk_rq_cur_sectors(rq) = 903
> // Ok,
> > > > this looks normal
> > > > segment 1: iter.bio->bi_sector = 1023, blk_rq_cur_sectors(rq) = 7 //
> > > > Whoah... this doesn't look right to me
> > >
> > > Seems fine to me. There's absolutely no reason two separate IOs
> > > can't be sub-page sector aligned or discontiguous given the above
> > > configuration. If that's what the getblocks callback returned to the
> > > DIO layer, then that's what you're going to see in the bios...
> > >
> > >
> > I guess my assumption that sectors would be adjacent was invalid, but
> this
> > is what I have read in ldd3 and other books and seen in some driver
> > examples. So I apologize for my folly and it's another lesson in not
> > believing everything you see on the net. The back half of my driver is
> > really optimized for transferring a single contiguous lba & transfer
> length
> > pair (ala SCSI WRITE 16 for example). However in desperation, I had
> changed
> > the driver to support requests like this (with disjoint sector ranges)
> but
> > it still seemed odd to me that occasionally some requests would even have
> > multiple segments that would *write* the same sectors/lengths. For
> example
> > I also see requests like the following.
> >
> > Given: rq_data_dir(rq) == 1 (i.e. a write)
> > segment   0: iter.bio->bi_sector = 0, blk_rq_cur_sectors(rq) = 7    // Ok
> > segment   1: iter.bio->bi_sector = 0, blk_rq_cur_sectors(rq) = 7    //
> > Again?
> > segment   2: iter.bio->bi_sector = 0, blk_rq_cur_sectors(rq) = 7    //
> Yet
> > again?
> > segment   3: iter.bio->bi_sector = 0, blk_rq_cur_sectors(rq) = 7    //
> > Really?
> > ...
> > segment 126: iter.bio->bi_sector = 0, blk_rq_cur_sectors(rq) = 7    //
> > What's going on?
> > // Reminder: segment limit is 128
> > segment 127: iter.bio->bi_sector = 1023, blk_rq_cur_sectors(rq) = 1 //
> > Haha, bet you though I was going to say sector = 0!
>
> That really looks to me like a bug in whatever is building that iter
> structure. It looks like there's an off-by-one on each page (should
> be 8 sectors not 7) and the bi_sector is not being set correctly to
> the sector offset of the bio within the IO. i.e. for a contiguous
> IO it should look something like:
>
> segment   0: iter.bio->bi_sector = 0, blk_rq_cur_sectors(rq) = 8
> segment   1: iter.bio->bi_sector = 8, blk_rq_cur_sectors(rq) = 8
> segment   2: iter.bio->bi_sector = 16, blk_rq_cur_sectors(rq) = 8
> ....
> segment 127: iter.bio->bi_sector = 1016, blk_rq_cur_sectors(rq) = 8
>
> I highly doubt the direct IO code is doing something wrong, because
> a mapping bug like that would show up everywhere and not just in your
> special driver....
>
>
I agree that the iterator *looks* wrong, Like I said earlier, I am just
using the rq_for_each_segment() macro (defined in include/linux/blkdev.h
and expands to a few dozen lines of code). This is the recommended way of
iterating requests instead of using the lower level bio iterating functions
or going it yourself which is why it's recommended to use that instead of
going deeper and breaking compatibility when the kernel changes. I suspect
it's the request itself which was constructed incorrectly, or contains bios
that should not have been grouped into the same request, or contains bios
that have their bi_sector set incorrectly.


> > > The read syscall is for a byte offset (from the fd, set by lseek)
> > > and a length, not a range of contiguous sectors on the device. That
> > > off/len tuple gets mapped by the underlying filesystem or device
> > > into an sector/len via a getblocks callback in the dio code and the
> > > bios are then built according to the mappings that are returned. So
> > > in many cases the IO that hits the block device looks nothing at all
> > > like the IO that came from userspace.
> > >
> > >
> > In general if I were reading a file I agree 100%, but this was the read
> > call issued by xfs_repair to read the superblock directly from a block
> > device. So I, perhaps wrongly, anticipated straightforward requests or
> > maybe even a single 512k request coming down to the request_fn. I should
> > have mentioned that this was observed while using xfs_repair earlier in
> the
> > email though, so no offense intended. There's more of the xfs_repair
> strace
> > in my response to your next comment below. It fails very early as you can
> > imagine when xfs_repair can't read the superblock.
>
> I figured that this is what you were talking about, but remember
> that the block device also does it's own offset->sector mapping in
> the direct IO layer. So it's entirely possible that the mapping
> function (i.e the getblocks callback) is broken and that is why you
> are seeing weird mappings in the bios being built by the direct Io
> code.
>
>
This is the xfs_repair tool being run directly against a block device, so
if I am reading the fs/direct-io.c code correctly this means that the
mapping function (get_block_t) is using fs/block_dev.c:blkdev_get_block().
Is that correct? That function is very short and simple, so I can't imagine
how there would be a bug in there. If it were a file on an xfs filesystem,
the get_block_t callback would be xfs_get_blocks or xfs_get_blocks_direct
(which both internally call __xfs_get_blocks). Is my understanding correct?

[-- Attachment #1.2: Type: text/html, Size: 10585 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2014-04-08  3:56 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-04-01 20:16 xfs: invalid requests to request_fn from xfs_repair Jamie Pocas
2014-04-01 20:42 ` Dave Chinner
2014-04-01 22:28   ` Jamie Pocas
2014-04-02  4:47     ` Dave Chinner
2014-04-05 18:19       ` Jamie Pocas
2014-04-08  3:56         ` Jamie Pocas

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.