[PATCH RFC] xfs: set block device logical sector size on xfs

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH RFC] xfs: set block device logical sector size on xfs_buftarg
@ 2013-11-13 18:25 Eric Sandeen
  2013-11-13 18:56 ` Christoph Hellwig
  2013-11-14  0:35 ` Eric Sandeen
  0 siblings, 2 replies; 20+ messages in thread
From: Eric Sandeen @ 2013-11-13 18:25 UTC (permalink / raw)
  To: xfs-oss

Pure RFC; this might be crazy.  Here's the problem I'm trying to solve:

Today, mkfs.xfs will select a 4k sector size for a 4k physical / 512 logical
drive.  (that change was done by me).  The thought was that it'd be an
efficiency gain to not make the drive do the (possible) RMW cycles on
512-byte log IO, primarily.

However, now this restricts all DIO to 4k alignment, not the otherwise-
possible 512.

This came up when qemu-kvm, in cache=none mode, tries to boot off an
image hosted on such a filesystem, and its bios wants to do a 512 byte
direct IO read off the disk - it fails.

But I'm wondering - the buftarg's bt_sshift and bt_smask are only used
in a few places.  

XFS_IOC_DIOINFO - to get the minimum io size
xfs_file_aio_read() and xfs_file_dio_aio_write() to check alignment
_xfs_buf_find() to be sure we aren't doing sub-sector IO.

So what I'm wondering is:  Can we somehow separate the "sector size"
that i.e. primarily the xfs log does its IO in, based on sb_sectsize,
from the actual, hard-minimum possible IO, in the buftarg bt_sshift &
bt_smask.

Something like this, though untested, and I'm probably missing something.

Our other option, I guess, is to just revert the mkfs change which
picks the physical rather than logical sector size, and go back to 
512 if it's available as a logical size.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
---

diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 8968f50..58ce036 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -814,10 +814,12 @@ STATIC int
 xfs_setup_devices(
 	struct xfs_mount	*mp)
 {
+	xfs_buftarg_t		btp;
 	int			error;

+	btp = mp->m_ddev_targp;
 	error = xfs_setsize_buftarg(mp->m_ddev_targp, mp->m_sb.sb_blocksize,
-				    mp->m_sb.sb_sectsize);
+				    bdev_logical_block_size(btp));
 	if (error)
 		return error;

@@ -833,9 +835,10 @@ xfs_setup_devices(
 			return error;
 	}
 	if (mp->m_rtdev_targp) {
+		btp = mp->m_rtdev_targp;
 		error = xfs_setsize_buftarg(mp->m_rtdev_targp,
 					    mp->m_sb.sb_blocksize,
-					    mp->m_sb.sb_sectsize);
+					    bdev_logical_block_size(btp));
 		if (error)
 			return error;
 	}

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC] xfs: set block device logical sector size on xfs_buftarg
  2013-11-13 18:25 [PATCH RFC] xfs: set block device logical sector size on xfs_buftarg Eric Sandeen
@ 2013-11-13 18:56 ` Christoph Hellwig
  2013-11-13 19:08   ` Eric Sandeen
  2013-11-14  0:35 ` Eric Sandeen
  1 sibling, 1 reply; 20+ messages in thread
From: Christoph Hellwig @ 2013-11-13 18:56 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: xfs-oss

On Wed, Nov 13, 2013 at 12:25:33PM -0600, Eric Sandeen wrote:
> Pure RFC; this might be crazy.  Here's the problem I'm trying to solve:
> 
> Today, mkfs.xfs will select a 4k sector size for a 4k physical / 512 logical
> drive.  (that change was done by me).  The thought was that it'd be an
> efficiency gain to not make the drive do the (possible) RMW cycles on
> 512-byte log IO, primarily.
> 
> However, now this restricts all DIO to 4k alignment, not the otherwise-
> possible 512.
> 
> This came up when qemu-kvm, in cache=none mode, tries to boot off an
> image hosted on such a filesystem, and its bios wants to do a 512 byte
> direct IO read off the disk - it fails.
> 
> But I'm wondering - the buftarg's bt_sshift and bt_smask are only used
> in a few places.  

No need to mess with kernel code IFF we want to change that, just keep
the sector size at 512 bytes and set a log stripe unit at mkfs time.

I have to admit that I'm not really sure if that's what we really want,
through.  A drive that has a larger physical block size will need
read-modify-write cycles internally, which we try to avoid.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC] xfs: set block device logical sector size on xfs_buftarg
  2013-11-13 18:56 ` Christoph Hellwig
@ 2013-11-13 19:08   ` Eric Sandeen
  2013-11-13 21:26     ` Dave Chinner
  0 siblings, 1 reply; 20+ messages in thread
From: Eric Sandeen @ 2013-11-13 19:08 UTC (permalink / raw)
  To: Christoph Hellwig, Eric Sandeen; +Cc: xfs-oss

On 11/13/13, 12:56 PM, Christoph Hellwig wrote:
> On Wed, Nov 13, 2013 at 12:25:33PM -0600, Eric Sandeen wrote:
>> Pure RFC; this might be crazy.  Here's the problem I'm trying to solve:
>>
>> Today, mkfs.xfs will select a 4k sector size for a 4k physical / 512 logical
>> drive.  (that change was done by me).  The thought was that it'd be an
>> efficiency gain to not make the drive do the (possible) RMW cycles on
>> 512-byte log IO, primarily.
>>
>> However, now this restricts all DIO to 4k alignment, not the otherwise-
>> possible 512.
>>
>> This came up when qemu-kvm, in cache=none mode, tries to boot off an
>> image hosted on such a filesystem, and its bios wants to do a 512 byte
>> direct IO read off the disk - it fails.
>>
>> But I'm wondering - the buftarg's bt_sshift and bt_smask are only used
>> in a few places.  
> 
> No need to mess with kernel code IFF we want to change that, just keep
> the sector size at 512 bytes and set a log stripe unit at mkfs time.
> 
> I have to admit that I'm not really sure if that's what we really want,
> through.  A drive that has a larger physical block size will need
> read-modify-write cycles internally, which we try to avoid.

Yeah, the problem comes up when it is 100% impossible to boot a
qemu-kvm guest hosted on such a filesystem/drive.  :(

(of course I guess that means it fails on a hard 4k drive too)

I don't know what the guest sees for logical/physical on its
file-backed block device in these cases.

Anyway, if we took your suggestion, normal internal fs operations
(log IO) wouldn't RMW.  But we'd still presumably advertise and allow
smaller DIO sizes, which are inefficient.  We could advertise 4k, but
still allow 512 for less-smart apps, maybe?

-Eric

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC] xfs: set block device logical sector size on xfs_buftarg
  2013-11-13 19:08   ` Eric Sandeen
@ 2013-11-13 21:26     ` Dave Chinner
  2013-11-13 21:32       ` Eric Sandeen
  2013-11-14 13:37       ` Christoph Hellwig
  0 siblings, 2 replies; 20+ messages in thread
From: Dave Chinner @ 2013-11-13 21:26 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Christoph Hellwig, Eric Sandeen, xfs-oss

On Wed, Nov 13, 2013 at 01:08:30PM -0600, Eric Sandeen wrote:
> On 11/13/13, 12:56 PM, Christoph Hellwig wrote:
> > On Wed, Nov 13, 2013 at 12:25:33PM -0600, Eric Sandeen wrote:
> >> Pure RFC; this might be crazy.  Here's the problem I'm trying to solve:
> >>
> >> Today, mkfs.xfs will select a 4k sector size for a 4k physical / 512 logical
> >> drive.  (that change was done by me).  The thought was that it'd be an
> >> efficiency gain to not make the drive do the (possible) RMW cycles on
> >> 512-byte log IO, primarily.
> >>
> >> However, now this restricts all DIO to 4k alignment, not the otherwise-
> >> possible 512.
> >>
> >> This came up when qemu-kvm, in cache=none mode, tries to boot off an
> >> image hosted on such a filesystem, and its bios wants to do a 512 byte
> >> direct IO read off the disk - it fails.
> >>
> >> But I'm wondering - the buftarg's bt_sshift and bt_smask are only used
> >> in a few places.  
> > 
> > No need to mess with kernel code IFF we want to change that, just keep
> > the sector size at 512 bytes and set a log stripe unit at mkfs time.
> > 
> > I have to admit that I'm not really sure if that's what we really want,
> > through.  A drive that has a larger physical block size will need
> > read-modify-write cycles internally, which we try to avoid.
> 
> Yeah, the problem comes up when it is 100% impossible to boot a
> qemu-kvm guest hosted on such a filesystem/drive.  :(

No it's not. Just use cache=writethrough and the page cache will
take care of the mismatch when it occurs.

> (of course I guess that means it fails on a hard 4k drive too)

And on any other filesystem that thinks it has sectors larger than
512 bytes underlying it (e.g. cdrom has a 2k sector size).

> I don't know what the guest sees for logical/physical on its
> file-backed block device in these cases.

Seems like that's the avenue for improvement here to me. i.e. expose
the correct values to the guest so it's mkfs does the right thing.
Or, alternatively, make qemu buffer non-aligned/sized IOs itself
internally.

After all, it has been told to use direct IO, and when that happens
it is the application's responsibility to ensure IO alignment
requirements are met...

> Anyway, if we took your suggestion, normal internal fs operations
> (log IO) wouldn't RMW.  But we'd still presumably advertise and allow
> smaller DIO sizes, which are inefficient.  We could advertise 4k, but
> still allow 512 for less-smart apps, maybe?

I'd say such a problem is a matter of user education and making qemu
aware of logical/physical differences - hacking weird corner cases
into what a sector size means is only going to lead to confusion and
bite us in unexpected ways...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC] xfs: set block device logical sector size on xfs_buftarg
  2013-11-13 21:26     ` Dave Chinner
@ 2013-11-13 21:32       ` Eric Sandeen
  2013-11-13 22:10         ` Dave Chinner
  2013-11-14 13:37       ` Christoph Hellwig
  1 sibling, 1 reply; 20+ messages in thread
From: Eric Sandeen @ 2013-11-13 21:32 UTC (permalink / raw)
  To: Dave Chinner, Eric Sandeen; +Cc: Christoph Hellwig, xfs-oss

On 11/13/13, 3:26 PM, Dave Chinner wrote:
> On Wed, Nov 13, 2013 at 01:08:30PM -0600, Eric Sandeen wrote:
>> On 11/13/13, 12:56 PM, Christoph Hellwig wrote:
>>> On Wed, Nov 13, 2013 at 12:25:33PM -0600, Eric Sandeen wrote:
>>>> Pure RFC; this might be crazy.  Here's the problem I'm trying to solve:
>>>>
>>>> Today, mkfs.xfs will select a 4k sector size for a 4k physical / 512 logical
>>>> drive.  (that change was done by me).  The thought was that it'd be an
>>>> efficiency gain to not make the drive do the (possible) RMW cycles on
>>>> 512-byte log IO, primarily.
>>>>
>>>> However, now this restricts all DIO to 4k alignment, not the otherwise-
>>>> possible 512.
>>>>
>>>> This came up when qemu-kvm, in cache=none mode, tries to boot off an
>>>> image hosted on such a filesystem, and its bios wants to do a 512 byte
>>>> direct IO read off the disk - it fails.
>>>>
>>>> But I'm wondering - the buftarg's bt_sshift and bt_smask are only used
>>>> in a few places.  
>>>
>>> No need to mess with kernel code IFF we want to change that, just keep
>>> the sector size at 512 bytes and set a log stripe unit at mkfs time.
>>>
>>> I have to admit that I'm not really sure if that's what we really want,
>>> through.  A drive that has a larger physical block size will need
>>> read-modify-write cycles internally, which we try to avoid.
>>
>> Yeah, the problem comes up when it is 100% impossible to boot a
>> qemu-kvm guest hosted on such a filesystem/drive.  :(
> 
> No it's not. Just use cache=writethrough and the page cache will
> take care of the mismatch when it occurs.

Sorry, I meant impossible w/ cache=none.

TBH, I don't know what best practice is.

>> (of course I guess that means it fails on a hard 4k drive too)
> 
> And on any other filesystem that thinks it has sectors larger than
> 512 bytes underlying it (e.g. cdrom has a 2k sector size).
> 
>> I don't know what the guest sees for logical/physical on its
>> file-backed block device in these cases.
> 
> Seems like that's the avenue for improvement here to me. i.e. expose
> the correct values to the guest so it's mkfs does the right thing.
> Or, alternatively, make qemu buffer non-aligned/sized IOs itself
> internally.

The guest never _boots_ - it's not a guest mkfs issue.

The guest bios wants to read 512 via DIO off the image on this 4k
sector FS, and fails.

> After all, it has been told to use direct IO, and when that happens
> it is the application's responsibility to ensure IO alignment
> requirements are met...

Agreed, but in talking to a qemu guy... 

"In my understanding, that's a limitation that directly comes from the BIOS interface."
"int 13h just assumes 512 bytes"

But this is above my pay grade.  I don't speak BIOS.

>> Anyway, if we took your suggestion, normal internal fs operations
>> (log IO) wouldn't RMW.  But we'd still presumably advertise and allow
>> smaller DIO sizes, which are inefficient.  We could advertise 4k, but
>> still allow 512 for less-smart apps, maybe?
> 
> I'd say such a problem is a matter of user education and making qemu
> aware of logical/physical differences - hacking weird corner cases
> into what a sector size means is only going to lead to confusion and
> bite us in unexpected ways...

Probably so; hence the "crazy" disclaimer.  ;)

But it does seem a little odd to semi-artificially reject DIOs which
the drive could actually handle.

Indeed, do_blockdev_direct_IO looks right at the logical block size,
and allows it:

        if (offset & blocksize_mask) {
                if (bdev)
                        blkbits = blksize_bits(bdev_logical_block_size(bdev));
                blocksize_mask = (1 << blkbits) - 1;
                if (offset & blocksize_mask)
                        goto out;
        }

it's our checks in XFS that fail.

-Eric

> Cheers,
> 
> Dave.
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC] xfs: set block device logical sector size on xfs_buftarg
  2013-11-13 21:32       ` Eric Sandeen
@ 2013-11-13 22:10         ` Dave Chinner
  2013-11-13 22:18           ` Eric Sandeen
  0 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2013-11-13 22:10 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Christoph Hellwig, Eric Sandeen, xfs-oss

On Wed, Nov 13, 2013 at 03:32:46PM -0600, Eric Sandeen wrote:
> On 11/13/13, 3:26 PM, Dave Chinner wrote:
> > On Wed, Nov 13, 2013 at 01:08:30PM -0600, Eric Sandeen wrote:
> >> On 11/13/13, 12:56 PM, Christoph Hellwig wrote:
> >>> On Wed, Nov 13, 2013 at 12:25:33PM -0600, Eric Sandeen wrote:
> >>>> Pure RFC; this might be crazy.  Here's the problem I'm trying to solve:
> >>>>
> >>>> Today, mkfs.xfs will select a 4k sector size for a 4k physical / 512 logical
> >>>> drive.  (that change was done by me).  The thought was that it'd be an
> >>>> efficiency gain to not make the drive do the (possible) RMW cycles on
> >>>> 512-byte log IO, primarily.
> >>>>
> >>>> However, now this restricts all DIO to 4k alignment, not the otherwise-
> >>>> possible 512.
> >>>>
> >>>> This came up when qemu-kvm, in cache=none mode, tries to boot off an
> >>>> image hosted on such a filesystem, and its bios wants to do a 512 byte
> >>>> direct IO read off the disk - it fails.
> >>>>
> >>>> But I'm wondering - the buftarg's bt_sshift and bt_smask are only used
> >>>> in a few places.  
> >>>
> >>> No need to mess with kernel code IFF we want to change that, just keep
> >>> the sector size at 512 bytes and set a log stripe unit at mkfs time.
> >>>
> >>> I have to admit that I'm not really sure if that's what we really want,
> >>> through.  A drive that has a larger physical block size will need
> >>> read-modify-write cycles internally, which we try to avoid.
> >>
> >> Yeah, the problem comes up when it is 100% impossible to boot a
> >> qemu-kvm guest hosted on such a filesystem/drive.  :(
> > 
> > No it's not. Just use cache=writethrough and the page cache will
> > take care of the mismatch when it occurs.
> 
> Sorry, I meant impossible w/ cache=none.
> 
> TBH, I don't know what best practice is.

No idea what best practice for the virt side is, but best practice
from a storage perspective is that everything in the IO stack should
use the sector size from the underlying layer. Given that virt adds
layers to the storage stack above the filesystem, that means it
needs to support whatever sector size the filesystem is using...

> >> (of course I guess that means it fails on a hard 4k drive too)
> > 
> > And on any other filesystem that thinks it has sectors larger than
> > 512 bytes underlying it (e.g. cdrom has a 2k sector size).
> > 
> >> I don't know what the guest sees for logical/physical on its
> >> file-backed block device in these cases.
> > 
> > Seems like that's the avenue for improvement here to me. i.e. expose
> > the correct values to the guest so it's mkfs does the right thing.
> > Or, alternatively, make qemu buffer non-aligned/sized IOs itself
> > internally.
> 
> The guest never _boots_ - it's not a guest mkfs issue.

Oh, that wasn't clear.

> The guest bios wants to read 512 via DIO off the image on this 4k
> sector FS, and fails.

So the bios has never been updated to handle 4k sector devices?

> > After all, it has been told to use direct IO, and when that happens
> > it is the application's responsibility to ensure IO alignment
> > requirements are met...
> 
> Agreed, but in talking to a qemu guy... 
> 
> "In my understanding, that's a limitation that directly comes from the BIOS interface."
> "int 13h just assumes 512 bytes"
> 
> But this is above my pay grade.  I don't speak BIOS.

Yet all modern bios implementations you find in hardware can boot 4k
sector devices just fine. So, what bios does qemu use?

$ man qemu
.....
QEMU uses the PC BIOS from the Bochs project and the Plex86/Bochs
LGPL VGA BIOS.
.....

So what we have here is an *open source bios* that doesn't handle
drives 4k sector sizes. There's the problem that needs to be fixed....

> >> Anyway, if we took your suggestion, normal internal fs operations
> >> (log IO) wouldn't RMW.  But we'd still presumably advertise and allow
> >> smaller DIO sizes, which are inefficient.  We could advertise 4k, but
> >> still allow 512 for less-smart apps, maybe?
> > 
> > I'd say such a problem is a matter of user education and making qemu
> > aware of logical/physical differences - hacking weird corner cases
> > into what a sector size means is only going to lead to confusion and
> > bite us in unexpected ways...
> 
> Probably so; hence the "crazy" disclaimer.  ;)
> 
> But it does seem a little odd to semi-artificially reject DIOs which
> the drive could actually handle.
> 
> Indeed, do_blockdev_direct_IO looks right at the logical block size,
> and allows it:
> 
>         if (offset & blocksize_mask) {
>                 if (bdev)
>                         blkbits = blksize_bits(bdev_logical_block_size(bdev));
>                 blocksize_mask = (1 << blkbits) - 1;
>                 if (offset & blocksize_mask)
>                         goto out;
>         }
> 
> it's our checks in XFS that fail.

No they don't - they are working just fine. We've told XFS that the
sector size is X, and therefore we don't allow IO in smaller units,
data or metadata.  That's the whole point of the filesystem having a
configurable sector size - we can *enforce* a larger minimum IO
requirement than the underlying hardware supports.

We've done this for years - e.g. long time ago MD devices had a
massive performance penalty for sub-page sized IOs, so mkfs set the
sector size to 4k to avoid that problem, even though we could have
done 512 byte IOs to the underlying devices.

Lets fix the problem at the source - the bios that doesn't support
4k sector devices - like we've done for all the other utilities that
need to be aware of disk sector sizes....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC] xfs: set block device logical sector size on xfs_buftarg
  2013-11-13 22:10         ` Dave Chinner
@ 2013-11-13 22:18           ` Eric Sandeen
  2013-11-14  0:34             ` Dave Chinner
  0 siblings, 1 reply; 20+ messages in thread
From: Eric Sandeen @ 2013-11-13 22:18 UTC (permalink / raw)
  To: Dave Chinner, Eric Sandeen; +Cc: Christoph Hellwig, xfs-oss

On 11/13/13, 4:10 PM, Dave Chinner wrote:

...

> Yet all modern bios implementations you find in hardware can boot 4k
> sector devices just fine.

hm can they really?  Most drives have 512 emulation.

> So, what bios does qemu use?
> 
> $ man qemu
> .....
> QEMU uses the PC BIOS from the Bochs project and the Plex86/Bochs
> LGPL VGA BIOS.
> .....
> 
> So what we have here is an *open source bios* that doesn't handle
> drives 4k sector sizes. There's the problem that needs to be fixed....

And if it wants to boot a guest OS that doesn't handle 4k sectors?

<snip>

>> it's our checks in XFS that fail.
> 
> No they don't - they are working just fine. We've told XFS that the
> sector size is X, and therefore we don't allow IO in smaller units,
> data or metadata.  That's the whole point of the filesystem having a
> configurable sector size - we can *enforce* a larger minimum IO
> requirement than the underlying hardware supports.

Semantics.  Yes, they work just fine, by failing the call.

> We've done this for years - e.g. long time ago MD devices had a
> massive performance penalty for sub-page sized IOs, so mkfs set the
> sector size to 4k to avoid that problem, even though we could have
> done 512 byte IOs to the underlying devices.
> 
> Lets fix the problem at the source - the bios that doesn't support
> 4k sector devices - like we've done for all the other utilities that
> need to be aware of disk sector sizes....

I don't disagree with that, but by looking at a 4k/512 drive and deciding
to make 4k sectors, we now present guests with something that barely
exists in the real world: a hard 4k drive w/ no 512 logical fallback.

Hacking up sector sizes in fs/xfs is probably the wrong way to go,
but I'm not sure that essentially forcing hard 4k sectors on every
qemu guest hosted on xfs is a great path either.

Sure, the bios should support 4k - I can ask about that.  But I think
the concern above still stands: in effect we present a device which is
less flexible than the real hardware beneath it; we've removed a
compatibility layer that plenty of software still depends on.

I'm not sure that's the best idea; at best it's unexpected.

-Eric

> Cheers,
> 
> Dave.
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC] xfs: set block device logical sector size on xfs_buftarg
  2013-11-13 22:18           ` Eric Sandeen
@ 2013-11-14  0:34             ` Dave Chinner
  0 siblings, 0 replies; 20+ messages in thread
From: Dave Chinner @ 2013-11-14  0:34 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Christoph Hellwig, Eric Sandeen, xfs-oss

On Wed, Nov 13, 2013 at 04:18:32PM -0600, Eric Sandeen wrote:
> On 11/13/13, 4:10 PM, Dave Chinner wrote:
> 
> ...
> 
> > Yet all modern bios implementations you find in hardware can boot 4k
> > sector devices just fine.
> 
> hm can they really?  Most drives have 512 emulation.
.....
> > We've done this for years - e.g. long time ago MD devices had a
> > massive performance penalty for sub-page sized IOs, so mkfs set the
> > sector size to 4k to avoid that problem, even though we could have
> > done 512 byte IOs to the underlying devices.
> > 
> > Lets fix the problem at the source - the bios that doesn't support
> > 4k sector devices - like we've done for all the other utilities that
> > need to be aware of disk sector sizes....
> 
> I don't disagree with that, but by looking at a 4k/512 drive and deciding
> to make 4k sectors, we now present guests with something that barely
> exists in the real world: a hard 4k drive w/ no 512 logical fallback.

Yet we've been doing this for *years*. And working around the
limitations of direct IO as we've moved tools to use it. e.g:

commit f63fd26819b82c766f9e31a28daaa16f387baaa3
Author: Dave Chinner <dchinner@redhat.com>
Date:   Mon Oct 10 01:08:31 2011 +0000

    repair: handle repair of image files on large sector size filesystems

    Because repair uses direct IO, it cannot do IO smaller than a sector
    on the underlying device. When repairing a filesystem image, the
    filesystem hosting the image may have a sector size larger than the
    sector size of the image, and so single image sector reads and
    writes will fail.

    To avoid this, when checking a file and there is a sector size
    mismatch like this, turn off direct IO. While there, fix a compile
    bug in the IO_DEBUG option for libxfs which was found during triage.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Alex Elder <aelder@sgi.com>

> Hacking up sector sizes in fs/xfs is probably the wrong way to go,
> but I'm not sure that essentially forcing hard 4k sectors on every
> qemu guest hosted on xfs is a great path either.

Just how widespread is the problem?  This is the first actual report
of this problem I've heard of, and the above commit came from seeing
the problem on my own system (I've got a 3.5TB MD RAID6 volume that
mkfs defaulted to 4k sector size when I made it *4 years ago*).

> Sure, the bios should support 4k - I can ask about that.  But I think
> the concern above still stands: in effect we present a device which is
> less flexible than the real hardware beneath it; we've removed a
> compatibility layer that plenty of software still depends on.

Regardless of the XFS side of things, we need to get all the
software that fails with 4k sectors fixed. That's been our modus
operandi since advanced format drives first appeared on the scene
years ago. Why should we suddenly treat qemu differently, especially
when there appears to be a simple workaround (cache=writethrough).

Yes, I'm being a hard-nosed bastard about this - we can change mkfs
defaults to go back to 512 byte sectors if we choose to, but that
doesn't fix the problem for everyone out there who already has 4k
sector filesystems. And that means qemu needs to be fixed, not
anything on the XFS side....

> I'm not sure that's the best idea; at best it's unexpected.

We never had any guarantee of 512 byte sectors on filesystems. There
never was any "compatibility layer". Direct IO exposes the
filesystem sector size directly to applications, and any application
that does direct IO is expected to handle this, no matter how
"unexpected" it is. Qemu is no exception.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC] xfs: set block device logical sector size on xfs_buftarg
  2013-11-13 18:25 [PATCH RFC] xfs: set block device logical sector size on xfs_buftarg Eric Sandeen
  2013-11-13 18:56 ` Christoph Hellwig
@ 2013-11-14  0:35 ` Eric Sandeen
  2013-11-14  6:49   ` Dave Chinner
  1 sibling, 1 reply; 20+ messages in thread
From: Eric Sandeen @ 2013-11-14  0:35 UTC (permalink / raw)
  To: Eric Sandeen, xfs-oss, Christoph Hellwig

On 11/13/13, 12:25 PM, Eric Sandeen wrote:
> Pure RFC; this might be crazy.  Here's the problem I'm trying to solve:
> 
> Today, mkfs.xfs will select a 4k sector size for a 4k physical / 512 logical
> drive.  (that change was done by me).  The thought was that it'd be an
> efficiency gain to not make the drive do the (possible) RMW cycles on
> 512-byte log IO, primarily.
> 
> However, now this restricts all DIO to 4k alignment, not the otherwise-
> possible 512.

So, backing up... ;)

XFS isn't doing anything wrong here.  It can make sector sizes as it pleases,
and apps had darned well better accommodate its whims if they do direct IO.

But some apps don't.  And users are sad and confused, and grow to dislike
XFS, because it all worked just fine on that other filesystem, so screw you
XFS, and your flux capacitor drives with your power-fail interrupts!

So my overarching goal here is to have XFS do its internal IO as efficiently
as possible on an "advanced format" drive, i.e. in 4k chunks, but not to
break apps that don't bother to check whether ye olde 512 DIO will work,
if the underlying storage can actually handle it.

We could even ensure that XFS_IOC_DIOINFO offers up "4k" as the answer
to miniosz, so that apps which bother to ask get the optimal answer.

But if we know, deep in our hearts, that a 512 byte DIO is ok, let's
let it pass.

Hacking up bt_sshift and friends might be the wrong way to do it, although
I'm not so sure - that's really all it's used for (today).

Christoph's suggestion to leave sector size at 512 but set a log stripe seems
interesting, too.

Or, we could stop setting 4k sectors for AF drives.

Or we could just carry on, and keep telling users that it's their fault,
their app's fault, etc...

(I'm sympathetic to pushing the envelope and dragging apps into the 21st
century, but it's s double edged sword).

-Eric

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC] xfs: set block device logical sector size on xfs_buftarg
  2013-11-14  0:35 ` Eric Sandeen
@ 2013-11-14  6:49   ` Dave Chinner
  2013-11-14 13:09     ` Ric Wheeler
  2013-11-14 15:18     ` Eric Sandeen
  0 siblings, 2 replies; 20+ messages in thread
From: Dave Chinner @ 2013-11-14  6:49 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Christoph Hellwig, Eric Sandeen, xfs-oss

On Wed, Nov 13, 2013 at 06:35:05PM -0600, Eric Sandeen wrote:
> On 11/13/13, 12:25 PM, Eric Sandeen wrote:
> > Pure RFC; this might be crazy.  Here's the problem I'm trying to solve:
> > 
> > Today, mkfs.xfs will select a 4k sector size for a 4k physical / 512 logical
> > drive.  (that change was done by me).  The thought was that it'd be an
> > efficiency gain to not make the drive do the (possible) RMW cycles on
> > 512-byte log IO, primarily.
> > 
> > However, now this restricts all DIO to 4k alignment, not the otherwise-
> > possible 512.
> 
> So, backing up... ;)
> 
> XFS isn't doing anything wrong here.  It can make sector sizes as it pleases,
> and apps had darned well better accommodate its whims if they do direct IO.
> 
> But some apps don't.  And users are sad and confused, and grow to dislike
> XFS, because it all worked just fine on that other filesystem, so screw you
> XFS, and your flux capacitor drives with your power-fail interrupts!

Funny how it's always XFS is at fault, when the same problem with 4k
sectors will occur on ext4, for example....

> So my overarching goal here is to have XFS do its internal IO as efficiently
> as possible on an "advanced format" drive, i.e. in 4k chunks, but not to
> break apps that don't bother to check whether ye olde 512 DIO will work,
> if the underlying storage can actually handle it.

Yup, it's called buffered IO.

> We could even ensure that XFS_IOC_DIOINFO offers up "4k" as the answer
> to miniosz, so that apps which bother to ask get the optimal answer.

Funnily enough, it does:

		da.d_mem = da.d_miniosz = 1 << target->bt_sshift;

$ sudo xfs_info .
meta-data=/dev/md0               isize=256    agcount=32, agsize=21503744 blks
         =                       sectsz=4096  attr=2, projid32bit=0
         =                       crc=0
data     =                       bsize=4096   blocks=688119680, imaxpct=5
         =                       sunit=32     swidth=320 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=335995, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
$ sudo xfs_io -c stat . |grep dioattr.miniosz
dioattr.miniosz = 4096
$

> But if we know, deep in our hearts, that a 512 byte DIO is ok, let's
> let it pass.

but, well, we don't know it's ok, because we don't know why 4k
sector size was chosen at mkfs time, even though the underlying
device might say it has a 512 byte logical sector size....

> Hacking up bt_sshift and friends might be the wrong way to do it, although
> I'm not so sure - that's really all it's used for (today).
> 
> Christoph's suggestion to leave sector size at 512 but set a log stripe seems
> interesting, too.

Which leaves all AG header writes as single 512 byte sector writes
which will trigger RMW in the hardware. And while those IOs are in
progress, we can't use the AG for allocation or freeing, so
increasing the IO latency of such IO is significant....

> Or, we could stop setting 4k sectors for AF drives.

And just take the RMW penalty?

> Or we could just carry on, and keep telling users that it's their fault,
> their app's fault, etc...

... and getting the problems fixed so they go away forever.

> (I'm sympathetic to pushing the envelope and dragging apps into the 21st
> century, but it's s double edged sword).

Yes, it is, but if we don't take a stand and say "we, as an
ecosystem, need to support 4k sectors *everywhere*", then we are
going to have such problems *forever*. This isn't purely an XFS
problem - this is something that the entire storage stack needs to
support, from the hardware at the very bottom to the applications at
the very top.

XFS is stuck in the middle, where we cop it from both
the hardware side ("why don't you support our hardware efficiently
yet?") and from the application side when we do ("4k sectors break
our assumptions!"). It's a no win situation for us no matter what we
do, and history has shown that when we don't take a strong
leadership position the problems don't get solved.

So, let's take the initiative and make sure that everyone knows how
to deal with these problems and get them fixed in the right places.
I don't want to be spending the next 10 years complaining about a
lack of 4k sector support in qemu. It's too much like the inode64
saga over all over again.

Let's face it, it wouldn't be right if XFS wasn't fighting some
battle to drag Linux kicking and screaming into the present...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC] xfs: set block device logical sector size on xfs_buftarg
  2013-11-14  6:49   ` Dave Chinner
@ 2013-11-14 13:09     ` Ric Wheeler
  2013-11-14 15:03       ` Eric Sandeen
  2013-11-14 15:18     ` Eric Sandeen
  1 sibling, 1 reply; 20+ messages in thread
From: Ric Wheeler @ 2013-11-14 13:09 UTC (permalink / raw)
  To: xfs


(large snip)

On 11/14/2013 03:49 PM, Dave Chinner wrote:
>> (I'm sympathetic to pushing the envelope and dragging apps into the 21st
>> >century, but it's s double edged sword).
> Yes, it is, but if we don't take a stand and say "we, as an
> ecosystem, need to support 4k sectors*everywhere*", then we are
> going to have such problems*forever*. This isn't purely an XFS
> problem - this is something that the entire storage stack needs to
> support, from the hardware at the very bottom to the applications at
> the very top.
>
> XFS is stuck in the middle, where we cop it from both
> the hardware side ("why don't you support our hardware efficiently
> yet?") and from the application side when we do ("4k sectors break
> our assumptions!"). It's a no win situation for us no matter what we
> do, and history has shown that when we don't take a strong
> leadership position the problems don't get solved.
>
> So, let's take the initiative and make sure that everyone knows how
> to deal with these problems and get them fixed in the right places.
> I don't want to be spending the next 10 years complaining about a
> lack of 4k sector support in qemu. It's too much like the inode64
> saga over all over again.
>
> Let's face it, it wouldn't be right if XFS wasn't fighting some
> battle to drag Linux kicking and screaming into the present...
>
> Cheers,
>
> Dave.

I would agree that we should not to hit our 4K sector support, have we reached 
out the the KVM/QEMU people to see if we can get them to fix this on their side?

Ric



_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC] xfs: set block device logical sector size on xfs_buftarg
  2013-11-13 21:26     ` Dave Chinner
  2013-11-13 21:32       ` Eric Sandeen
@ 2013-11-14 13:37       ` Christoph Hellwig
  2013-11-14 14:56         ` Eric Sandeen
  1 sibling, 1 reply; 20+ messages in thread
From: Christoph Hellwig @ 2013-11-14 13:37 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, Eric Sandeen, Eric Sandeen, xfs-oss

On Thu, Nov 14, 2013 at 08:26:58AM +1100, Dave Chinner wrote:
> Seems like that's the avenue for improvement here to me. i.e. expose
> the correct values to the guest so it's mkfs does the right thing.
> Or, alternatively, make qemu buffer non-aligned/sized IOs itself
> internally.

I've implemented the support to expose these to the guest in qemu
years ago.  But the problem remains that this is information which
needs to be attached to the image, which can't really work with raw
images, and no one has bother to implement the support to store it
for say qcow2.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC] xfs: set block device logical sector size on xfs_buftarg
  2013-11-14 13:37       ` Christoph Hellwig
@ 2013-11-14 14:56         ` Eric Sandeen
  2013-11-14 21:01           ` Dave Chinner
  0 siblings, 1 reply; 20+ messages in thread
From: Eric Sandeen @ 2013-11-14 14:56 UTC (permalink / raw)
  To: Christoph Hellwig, Dave Chinner; +Cc: Eric Sandeen, xfs-oss

On 11/14/13, 7:37 AM, Christoph Hellwig wrote:
> On Thu, Nov 14, 2013 at 08:26:58AM +1100, Dave Chinner wrote:
>> Seems like that's the avenue for improvement here to me. i.e. expose
>> the correct values to the guest so it's mkfs does the right thing.
>> Or, alternatively, make qemu buffer non-aligned/sized IOs itself
>> internally.
> 
> I've implemented the support to expose these to the guest in qemu
> years ago.  But the problem remains that this is information which
> needs to be attached to the image, which can't really work with raw
> images, and no one has bother to implement the support to store it
> for say qcow2.
> 

Ok but once again - this is not a guest mkfs issue.  The reported
problem is that the guest cannot _boot_ in cache=none mode because
the bios attempts a 512-byte DIO.

Yes, this is all qemu's fault.   We can fight that in the court
of public opinion, I guess.

-Eric

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC] xfs: set block device logical sector size on xfs_buftarg
  2013-11-14 13:09     ` Ric Wheeler
@ 2013-11-14 15:03       ` Eric Sandeen
  0 siblings, 0 replies; 20+ messages in thread
From: Eric Sandeen @ 2013-11-14 15:03 UTC (permalink / raw)
  To: Ric Wheeler, xfs

On 11/14/13, 7:09 AM, Ric Wheeler wrote:
> I would agree that we should not to hit our 4K sector support, have
> we reached out the the KVM/QEMU people to see if we can get them to
> fix this on their side?

Yeah, talking to them now (well, not on a public list yet).

-Eric

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC] xfs: set block device logical sector size on xfs_buftarg
  2013-11-14  6:49   ` Dave Chinner
  2013-11-14 13:09     ` Ric Wheeler
@ 2013-11-14 15:18     ` Eric Sandeen
  1 sibling, 0 replies; 20+ messages in thread
From: Eric Sandeen @ 2013-11-14 15:18 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, Eric Sandeen, xfs-oss

On 11/14/13, 12:49 AM, Dave Chinner wrote:
> On Wed, Nov 13, 2013 at 06:35:05PM -0600, Eric Sandeen wrote:
>> On 11/13/13, 12:25 PM, Eric Sandeen wrote:
>>> Pure RFC; this might be crazy.  Here's the problem I'm trying to solve:
>>>
>>> Today, mkfs.xfs will select a 4k sector size for a 4k physical / 512 logical
>>> drive.  (that change was done by me).  The thought was that it'd be an
>>> efficiency gain to not make the drive do the (possible) RMW cycles on
>>> 512-byte log IO, primarily.
>>>
>>> However, now this restricts all DIO to 4k alignment, not the otherwise-
>>> possible 512.
>>
>> So, backing up... ;)
>>
>> XFS isn't doing anything wrong here.  It can make sector sizes as it pleases,
>> and apps had darned well better accommodate its whims if they do direct IO.
>>
>> But some apps don't.  And users are sad and confused, and grow to dislike
>> XFS, because it all worked just fine on that other filesystem, so screw you
>> XFS, and your flux capacitor drives with your power-fail interrupts!
> 
> Funny how it's always XFS is at fault, when the same problem with 4k
> sectors will occur on ext4, for example....

Yep on a non-existent hard 4k disk, ext4 would have the same problem.

Meanwhile in the world of actual hardware, ext4 is fine.  (there's no
similar sector-size switch for ext4).

Again; I'm NOT saying xfs is doing anything wrong, or is at fault.

We can be right all the way to the grave, if apps never get fixed,
and users have a choice of fs.

...

>> We could even ensure that XFS_IOC_DIOINFO offers up "4k" as the answer
>> to miniosz, so that apps which bother to ask get the optimal answer.
> 
> Funnily enough, it does:
> 
> 		da.d_mem = da.d_miniosz = 1 << target->bt_sshift;

...

Of course it does today; I was talking about whether we could report this
but still allow 512 under the covers.
 
>> Or, we could stop setting 4k sectors for AF drives.
> 
> And just take the RMW penalty?

that, and the bonus of existing applications continuing to function.
 
>> Or we could just carry on, and keep telling users that it's their fault,
>> their app's fault, etc...
> 
> ... and getting the problems fixed so they go away forever.

... or not.  *cough*64 bit inodes*cough*

>> (I'm sympathetic to pushing the envelope and dragging apps into the 21st
>> century, but it's s double edged sword).
> 
> Yes, it is, but if we don't take a stand and say "we, as an
> ecosystem, need to support 4k sectors *everywhere*", then we are
> going to have such problems *forever*. This isn't purely an XFS
> problem - this is something that the entire storage stack needs to
> support, from the hardware at the very bottom to the applications at
> the very top.
> 
> XFS is stuck in the middle, where we cop it from both
> the hardware side ("why don't you support our hardware efficiently
> yet?") and from the application side when we do ("4k sectors break
> our assumptions!"). It's a no win situation for us no matter what we
> do, and history has shown that when we don't take a strong
> leadership position the problems don't get solved.
> 
> So, let's take the initiative and make sure that everyone knows how
> to deal with these problems and get them fixed in the right places.
> I don't want to be spending the next 10 years complaining about a
> lack of 4k sector support in qemu. It's too much like the inode64
> saga over all over again.

which, TBH, has still never been fully addressed.

> Let's face it, it wouldn't be right if XFS wasn't fighting some
> battle to drag Linux kicking and screaming into the present...

Well.  With my distro hat on I might have to be pragmatic, and keep
things working that are required to work.

Upstream, sure, we can keep beating users with a stick until they
force their app writers to make things work for them again.  ;)

(Again, though, as middle ground - if there were a way for XFS to do
all internal IO efficiently as 4k-aligned, but allow applications
to do 512 emulation, that would be, IMHO, a great thing.  I'm not
yet sure what it would take.)

-Eric

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC] xfs: set block device logical sector size on xfs_buftarg
  2013-11-14 14:56         ` Eric Sandeen
@ 2013-11-14 21:01           ` Dave Chinner
  2013-11-22 14:13             ` Ric Wheeler
  0 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2013-11-14 21:01 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Christoph Hellwig, Eric Sandeen, xfs-oss

On Thu, Nov 14, 2013 at 08:56:04AM -0600, Eric Sandeen wrote:
> On 11/14/13, 7:37 AM, Christoph Hellwig wrote:
> > On Thu, Nov 14, 2013 at 08:26:58AM +1100, Dave Chinner wrote:
> >> Seems like that's the avenue for improvement here to me. i.e. expose
> >> the correct values to the guest so it's mkfs does the right thing.
> >> Or, alternatively, make qemu buffer non-aligned/sized IOs itself
> >> internally.
> > 
> > I've implemented the support to expose these to the guest in qemu
> > years ago.  But the problem remains that this is information which
> > needs to be attached to the image, which can't really work with raw
> > images, and no one has bother to implement the support to store it
> > for say qcow2.
> > 
> 
> Ok but once again - this is not a guest mkfs issue.  The reported
> problem is that the guest cannot _boot_ in cache=none mode because
> the bios attempts a 512-byte DIO.

A different viewpoint: How can we make sure real 4k sector hardware
works with Linux when it comes along if we can't emulate it via
qemu + virtualisation?

People often use qemu + virutalisation as a method of testing code
for hardware they don't have access to, and this just seems like
another of those things that we should have working in this
environment long before real hardware comes along and requires it...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC] xfs: set block device logical sector size on xfs_buftarg
  2013-11-14 21:01           ` Dave Chinner
@ 2013-11-22 14:13             ` Ric Wheeler
  2013-11-22 14:20               ` Christoph Hellwig
  2013-11-22 14:57               ` Eric Sandeen
  0 siblings, 2 replies; 20+ messages in thread
From: Ric Wheeler @ 2013-11-22 14:13 UTC (permalink / raw)
  To: Dave Chinner, Eric Sandeen; +Cc: Christoph Hellwig, Eric Sandeen, xfs-oss

On 11/14/2013 04:01 PM, Dave Chinner wrote:
> On Thu, Nov 14, 2013 at 08:56:04AM -0600, Eric Sandeen wrote:
>> On 11/14/13, 7:37 AM, Christoph Hellwig wrote:
>>> On Thu, Nov 14, 2013 at 08:26:58AM +1100, Dave Chinner wrote:
>>>> Seems like that's the avenue for improvement here to me. i.e. expose
>>>> the correct values to the guest so it's mkfs does the right thing.
>>>> Or, alternatively, make qemu buffer non-aligned/sized IOs itself
>>>> internally.
>>> I've implemented the support to expose these to the guest in qemu
>>> years ago.  But the problem remains that this is information which
>>> needs to be attached to the image, which can't really work with raw
>>> images, and no one has bother to implement the support to store it
>>> for say qcow2.
>>>
>> Ok but once again - this is not a guest mkfs issue.  The reported
>> problem is that the guest cannot _boot_ in cache=none mode because
>> the bios attempts a 512-byte DIO.
> A different viewpoint: How can we make sure real 4k sector hardware
> works with Linux when it comes along if we can't emulate it via
> qemu + virtualisation?
>
> People often use qemu + virutalisation as a method of testing code
> for hardware they don't have access to, and this just seems like
> another of those things that we should have working in this
> environment long before real hardware comes along and requires it...
>
> Cheers,
>
> Dave.

I think you do that by using SCSI debug to get a 4K sector drive - that is how 
we tested for RHEL6 for example.  Layering on restrictions to hardware in the 
file system seems a bit harsh.

The QEMU crowd will be working to get better support for 4K drives in the 
future, but I think that we are effectively going to cause a huge field issue 
here since these 512/4K drives are extremely common..

Given the SCSI debug method for this, does that mean you retract your objections 
and will support Eric's patch :) ?

Regards,

Ric

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC] xfs: set block device logical sector size on xfs_buftarg
  2013-11-22 14:13             ` Ric Wheeler
@ 2013-11-22 14:20               ` Christoph Hellwig
  2013-11-22 14:26                 ` Ric Wheeler
  2013-11-22 14:57               ` Eric Sandeen
  1 sibling, 1 reply; 20+ messages in thread
From: Christoph Hellwig @ 2013-11-22 14:20 UTC (permalink / raw)
  To: Ric Wheeler; +Cc: Christoph Hellwig, Eric Sandeen, xfs-oss, Eric Sandeen

On Fri, Nov 22, 2013 at 09:13:56AM -0500, Ric Wheeler wrote:
> I think you do that by using SCSI debug to get a 4K sector drive -
> that is how we tested for RHEL6 for example.  Layering on
> restrictions to hardware in the file system seems a bit harsh.
> 
> The QEMU crowd will be working to get better support for 4K drives
> in the future, but I think that we are effectively going to cause a
> huge field issue here since these 512/4K drives are extremely
> common..
> 
> Given the SCSI debug method for this, does that mean you retract
> your objections and will support Eric's patch :) ?

We actually have a test for 4k drives using scsi_debug in xfstests: xfs/279.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC] xfs: set block device logical sector size on xfs_buftarg
  2013-11-22 14:20               ` Christoph Hellwig
@ 2013-11-22 14:26                 ` Ric Wheeler
  0 siblings, 0 replies; 20+ messages in thread
From: Ric Wheeler @ 2013-11-22 14:26 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Eric Sandeen, xfs-oss, Eric Sandeen

On 11/22/2013 09:20 AM, Christoph Hellwig wrote:
> On Fri, Nov 22, 2013 at 09:13:56AM -0500, Ric Wheeler wrote:
>> I think you do that by using SCSI debug to get a 4K sector drive -
>> that is how we tested for RHEL6 for example.  Layering on
>> restrictions to hardware in the file system seems a bit harsh.
>>
>> The QEMU crowd will be working to get better support for 4K drives
>> in the future, but I think that we are effectively going to cause a
>> huge field issue here since these 512/4K drives are extremely
>> common..
>>
>> Given the SCSI debug method for this, does that mean you retract
>> your objections and will support Eric's patch :) ?
> We actually have a test for 4k drives using scsi_debug in xfstests: xfs/279.
>

Just to add on here. we are going to work more closely with the kvm/qemu people 
to make sure that they properly take care of the hints about alignment and so on 
(thanks to you for putting that in!).

Ric

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH RFC] xfs: set block device logical sector size on xfs_buftarg
  2013-11-22 14:13             ` Ric Wheeler
  2013-11-22 14:20               ` Christoph Hellwig
@ 2013-11-22 14:57               ` Eric Sandeen
  1 sibling, 0 replies; 20+ messages in thread
From: Eric Sandeen @ 2013-11-22 14:57 UTC (permalink / raw)
  To: Ric Wheeler, Dave Chinner; +Cc: Christoph Hellwig, Eric Sandeen, xfs-oss

On 11/22/13, 8:13 AM, Ric Wheeler wrote:

<snip>

> I think you do that by using SCSI debug to get a 4K sector drive -
> that is how we tested for RHEL6 for example. Layering on restrictions
> to hardware in the file system seems a bit harsh.
> 
> The QEMU crowd will be working to get better support for 4K drives in
> the future, but I think that we are effectively going to cause a huge
> field issue here since these 512/4K drives are extremely common..
> 
> Given the SCSI debug method for this, does that mean you retract your
> objections and will support Eric's patch :) ?

FWIW, my patch is a disaster, but I'll work on something along those
lines that's not a disaster, so we can discuss it properly.  ;)

To make this go, I think we need to add a structure member to the
xfs_buftarg which describes the logical sector size, and use that
to enforce minimum IO sizes.  The current sector size fields can
remain in place for the mkfs-specified, presumably physical sector
size.

Then, since the sector sizes in the sb, mp, and buftarg have been
disassociated a bit, I'll need to audit things like the sub-block
zeroing paths so that we DTRT on a sub-block DIO.

At that point, the "sector size" semantics in the mkfs.xfs manpage
get a little weird; if we specify a sector size of 4k, how can
we do sub-sector IOs?

What the mkfs option really means at that point is that the specified
size is the minimum size and alignment which will be generated from
within the filesystem for metadata; we can make it clear that the
underlying logical sector size is still the constraint for userspace
DIO.

I'm not quite sure what the XFS_IOC_DIOINFO ioctl should advertise,
at that point.

Anyway, that's about where I'm at in my brain with all this, will
try to get something that actually works relatively soon.

-Eric

> Regards,
> 
> Ric
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2013-11-22 14:57 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-11-13 18:25 [PATCH RFC] xfs: set block device logical sector size on xfs_buftarg Eric Sandeen
2013-11-13 18:56 ` Christoph Hellwig
2013-11-13 19:08   ` Eric Sandeen
2013-11-13 21:26     ` Dave Chinner
2013-11-13 21:32       ` Eric Sandeen
2013-11-13 22:10         ` Dave Chinner
2013-11-13 22:18           ` Eric Sandeen
2013-11-14  0:34             ` Dave Chinner
2013-11-14 13:37       ` Christoph Hellwig
2013-11-14 14:56         ` Eric Sandeen
2013-11-14 21:01           ` Dave Chinner
2013-11-22 14:13             ` Ric Wheeler
2013-11-22 14:20               ` Christoph Hellwig
2013-11-22 14:26                 ` Ric Wheeler
2013-11-22 14:57               ` Eric Sandeen
2013-11-14  0:35 ` Eric Sandeen
2013-11-14  6:49   ` Dave Chinner
2013-11-14 13:09     ` Ric Wheeler
2013-11-14 15:03       ` Eric Sandeen
2013-11-14 15:18     ` Eric Sandeen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.