Re: Question regarding XFS on LVM over hardware RAID.

From: Stan Hoeppner <stan@hardwarefreak.com>
To: Dave Chinner <david@fromorbit.com>,
	"C. Morgan Hamill" <chamill@wesleyan.edu>
Cc: xfs <xfs@oss.sgi.com>
Subject: Re: Question regarding XFS on LVM over hardware RAID.
Date: Tue, 04 Feb 2014 02:00:54 -0600	[thread overview]
Message-ID: <52F09E36.8050606@hardwarefreak.com> (raw)
In-Reply-To: <20140203214128.GR13997@dastard>

On 2/3/2014 3:41 PM, Dave Chinner wrote:
> On Mon, Feb 03, 2014 at 11:12:39AM -0500, C. Morgan Hamill wrote:
>> Excerpts from Dave Chinner's message of 2014-02-02 16:21:52 -0500:
>>> On Sat, Feb 01, 2014 at 03:06:17PM -0600, Stan Hoeppner wrote:
>>>> On 1/31/2014 3:14 PM, C. Morgan Hamill wrote:
>>>>> So, basically, --dataalignment is my friend during pvcreate and
>>>>> lvcreate.
>>>>
>>>> If the logical sector size reported by your RAID controller is 512
>>>> bytes, then "--dataalignment=9216s" should start your data section on a
>>>> RAID60 stripe boundary after the metadata section.
>>>>
>>>> Tthe PhysicalExtentSize should probably also match the 4608KB stripe
>>>> width, but this is apparently not possible.  PhysicalExtentSize must be
>>>> a power of 2 value.  I don't know if or how this will affect XFS aligned
>>>> write out.  You'll need to consult with someone more knowledgeable of LVM.
>>>
>>> You can't do single IOs of that size, anyway, so this is where the
>>> BBWC on the raid controller does it's magic and caches sequntial IOs
>>> until it has full stripe writes cached....
>>
>> So I am probably missing something here, could you clarify?  Are you
>> saying that I can't do single IOs of that size (by which I take your
>> meaning to be IOs as small as 9216 sectors) because my RAID controllers
>> controller won't let me (i.e., it will cache anything smaller than the
>> stripe size anyway)?
> 
> Typical limitations on IO size are the size of the hardware DMA
> scatter-gather rings of the HBA/raid controller. For example, the
> two hardware RAID controllers in my largest test box have
> limitations of 70 and 80 segments and maximum IO sizes of 280k and
> 320k.
> 
> And looking at the IO being dispatched with blktrace, I see the
> maximum size is:
> 
>   8,80   2       61     0.769857112 44866  D  WS 12423408 + 560 [qemu-system-x86]
>   8,80   2       71     0.769877563 44866  D  WS 12423968 + 560 [qemu-system-x86]
>   8,80   2       72     0.769889767 44866  D  WS 12424528 + 560 [qemu-system-x86]
>                                                             ^^^
> 
> 560 sectors or 280k. So for this hardware, sequential 280k writes
> are hitting the BBWC. And because they are sequential, the BBWC is
> writing them back as fully stripe writes after aggregating them in
> NVRAM. Hence there are no performance diminishing RMW cycles
> occurring, even though the individual IO size is much smaller than
> the stripe unit/width....
> 
>> Or are you saying that XFS with these given
>> settings won't make writes that small (which seems false, since I'm
>> essentially telling it to do writes of precisely that size).  I'm a bit
>> unclear on that.
> 
> What su/sw tells XFs is how to align allocation of files, so that
> when we dispatch sequential IO to that file it is aligned to the
> underlying storage because the extents that the filesystem allocated
> for it are aligned. This means that if you write exactly one stripe
> width of data, it will hit each disk exactly once. It might take 10
> IOs to get the data to the storage, but it will only hit each disk
> once.
> 
> The function of the stripe cache (in software raid) and the BBWC (in
> hardware RAID) is to prevent RMW cycles while the
> filesystem/hardware is still flinging data at the RAID lun. Only
> once the controller has complete stripe widths will it calculate
> parity and write back the data, thereby avoiding a RMW cycle....

-------
>> In addition, does this in effect mean that when it comes to LVM, extent
>> size makes no difference for alignment purposes?  So I don't have to
>> worry about anything other that aligning the beginning and ending of
>> logical volumes, volume groups, etc. to 9216 sector multiples?
> 
> No, you still have to align everything to the underlying storage so
> that the filesystem on top of the volumes is correctly aligned.
> Where the data will be written (i.e. howthe filesystem allocates the
> underlying blocks) determines the IO alignment of sequential/large
> user IOs, and that matters far more than the size of the sequntial
> IOs the kernel uses to write the data.

After a little digging and thinking this through...

The default PE size is 4MB but up to 16GB with LVM1, and apparently
unlimited size with LVM2.  It can be a few thousand times larger than
any sane stripe width.  This makes it pretty clear that PEs exist
strictly for volume management operations, used by the LVM tools, but
have no relationship to regular write IOs.  Thus the PE size need not
match nor be evenly divisible by the stripe width.  It's not part of the
alignment equation.

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs