All of lore.kernel.org
 help / color / mirror / Atom feed
* RAID6 + XFS + Application Workload write performance question
@ 2017-10-30 16:43 Kyle Ames
  2017-10-30 21:39 ` Dave Chinner
  2017-10-31 11:06 ` Emmanuel Florac
  0 siblings, 2 replies; 3+ messages in thread
From: Kyle Ames @ 2017-10-30 16:43 UTC (permalink / raw)
  To: linux-xfs

Hello!

I’m trying to track down odd write performance from a test of our application’s I/O workload. Admittedly I am not extremely experienced with this domain (file systems, storage, tuning, etc.). I’ve done a ton of research and I think I’ve gotten as far as I possibly can without reaching out for help from domain experts.

Here is the setup:

OS: CentOS 7.3
Kernel: 3.10.0-693.2.2.el7.x86_64
RAID: LSI RAID controller - RAID 6 with 10 disks - Strip Size 128 (and thus a stripe size of 1MB if I understand correctly)
LVM: One PV, VG, and LV is built on top of the RAID6. The output is below.

RAID output:
-----------------------------------------------------------------
DG/VD TYPE  State Access Consist Cache Cac sCC      Size Name
-----------------------------------------------------------------
0/0   RAID6 Optl  RW     No      RAWBD -   ON  72.761 TB DATA

----------------------------------------------------------------------------
EID:Slt DID State DG     Size Intf Med SED PI SeSz Model            Sp Type
----------------------------------------------------------------------------
18:0     35 Onln   0 9.094 TB SAS  HDD N   N  512B ST10000NM0096    U  -
18:1     39 Onln   0 9.094 TB SAS  HDD N   N  512B ST10000NM0096    U  -
18:2     38 Onln   0 9.094 TB SAS  HDD N   N  512B ST10000NM0096    U  -
18:3     41 Onln   0 9.094 TB SAS  HDD N   N  512B ST10000NM0096    U  -
18:4     36 Onln   0 9.094 TB SAS  HDD N   N  512B ST10000NM0096    U  -
18:5     37 Onln   0 9.094 TB SAS  HDD N   N  512B ST10000NM0096    U  -
18:6     42 Onln   0 9.094 TB SAS  HDD N   N  512B ST10000NM0096    U  -
18:7     45 Onln   0 9.094 TB SAS  HDD N   N  512B ST10000NM0096    U  -
18:8     46 Onln   0 9.094 TB SAS  HDD N   N  512B ST10000NM0096    U  -
18:9     44 Onln   0 9.094 TB SAS  HDD N   N  512B ST10000NM0096    U  -
18:10    43 Onln   1 9.094 TB SAS  HDD N   N  512B ST10000NM0096    U  -
18:11    40 Onln   1 9.094 TB SAS  HDD N   N  512B ST10000NM0096    U  -


Layered on top of this we have LVM with everything aligned to the stripe size.

--- Physical volume ---
  PV Name               /dev/sda
VG Name               vgdata
  PV Size               72.76 TiB / not usable 4.00 MiB
  Allocatable           yes (but full)
  PE Size               4.00 MiB
  Total PE              19074047
  Free PE               0
  Allocated PE          19074047
  PV UUID               GsHPeD-5uRM-SOUz-8eEO-kznf-zTaT-oEIx58

  --- Volume group ---
VG Name               vgdata
  System ID
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  4
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                1
  Open LV               1
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               72.76 TiB
  PE Size               4.00 MiB
  Total PE              19074047
  Alloc PE / Size       19074047 / 72.76 TiB
  Free  PE / Size       0 / 0
  VG UUID               PKFh9X-3gTb-0GZO-vLAc-vdPI-lV6W-aoNDDU

--- Logical volume ---
  LV Path                /dev/vgdata/lvdata
  LV Name                lvdata
  VG Name                vgdata
  LV UUID                esOGWf-jV89-7euY-WV3h-MZ2p-uqmt-qXkC5F
  LV Write Access        read/write
  LV Creation host, time XXXXXXX, 2017-10-25 14:34:06 +0000
  LV Status              available
  # open                 1
  LV Size                72.76 TiB
  Current LE             19074047
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:0

Here is the output when the XFS filesystem is created:

mkfs.xfs -d su=128k,sw=8 -L DATA -f /dev/mapper/vgdata-lvdata
mkfs.xfs: Specified data stripe width 2048 is not the same as the volume stripe width 512        (KEA: I’m not sure if this is an actual problem or not from googling around)
meta-data=/dev/mapper/vgdata-lvdata isize=512    agcount=73, agsize=268435424 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=0, sparse=0
data     =                       bsize=4096   blocks=19531824128, imaxpct=1
         =                       sunit=32     swidth=256 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal log           bsize=4096   blocks=521728, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0


And finally the output of mount: (rw,relatime,attr2,inode64,sunit=256,swidth=2048,noquota)

Our application writes to a nested directory hierarchy as follows:

<THREAD>/<DATE>/<HOUR>/<MINUTE>/<DATE>-<HOUR><MINUTE<SECOND>.data

The writes to a file are all approximately 1MiB in size. Depending on our data rate, there are potentially 100-200 writes to a given file. Each second a new file is created. There are 4 such threads scheduling these writes. The writes to the files will be scheduled sequentially. The application can do AIO or blocking IO, both see the same problem.

In order to determine if our application is fully to blame, we are running a performance test by simulating the writes with a “dd if=/dev/zero of=<path/like/above>.data bs=1024K”. We let those dd processes write as much as they possibly can for a second before stopping them and moving on to the next file.

What we’re seeing is that write performance starts off around 1400-1500 MB/s, decreasing approximately linearly all the way down to around ~600 MB/s after ~18 minutes before suddenly shooting back up to 1400-1500 MB/s. This cycle continues, with the crest and troughs slowly decreasing as the disk fills up (which I believe is expected).

We tried running it with 2 threads. We saw the same degradation and recovery performance profile, except it took ~36 minutes to bottom out and recover. Likewise, with only 1 thread it took ~72 minutes. In all cases the pattern continued until the disk was full.

We thought perhaps the directory structure was problematic, so we tried the following directory structure too: <THREAD>/<DATE>/<DATE>-<HOUR><MINUTE<SECOND>.data. This also had one file per second. This time, it took about 12.2 hours for the performance to bottom out before instantly shooting back up again.

Some other notes:

- I ran the same test with an Adaptec RAID controller as well, which gave the same performance profile.
- I ran the same test with an ext4 filesystem just to see if it gave the same performance profile. It did not - the performance slowly degraded over time before a quick dropoff as the disk reached max capacity. I expected a different profile, but just wanted to run something to make sure that would be the case.

I’ve been trying to read about internals to correlate this performance profile with something, but since I’m so new to this it’s kind of tough to filter out noise and key in on something meaningful. If there is any information that I haven’t provided, please let me know and I’ll happily provide it.

Thanks!!!

-Kyle Ames






This email and any attachments thereto may contain private, confidential, and/or privileged material for the sole use of the intended recipient. Any review, copying, or distribution of this email (or any attachments thereto) by others is strictly prohibited. If you are not the intended recipient, please contact the sender immediately and permanently delete the original and any copies of this email and any attachments thereto.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: RAID6 + XFS + Application Workload write performance question
  2017-10-30 16:43 RAID6 + XFS + Application Workload write performance question Kyle Ames
@ 2017-10-30 21:39 ` Dave Chinner
  2017-10-31 11:06 ` Emmanuel Florac
  1 sibling, 0 replies; 3+ messages in thread
From: Dave Chinner @ 2017-10-30 21:39 UTC (permalink / raw)
  To: Kyle Ames; +Cc: linux-xfs

On Mon, Oct 30, 2017 at 04:43:15PM +0000, Kyle Ames wrote:
> Hello!
> 
> I’m trying to track down odd write performance from a test of
> our application’s I/O workload. Admittedly I am not extremely
> experienced with this domain (file systems, storage, tuning,
> etc.). I’ve done a ton of research and I think I’ve gotten
> as far as I possibly can without reaching out for help from domain
> experts.
>
> Here is the setup:
> 
> OS: CentOS 7.3
> Kernel: 3.10.0-693.2.2.el7.x86_64
> RAID: LSI RAID controller - RAID 6 with 10 disks - Strip Size 128 (and thus a stripe size of 1MB if I understand correctly)

[snip]

> mkfs.xfs -d su=128k,sw=8 -L DATA -f /dev/mapper/vgdata-lvdata
> mkfs.xfs: Specified data stripe width 2048 is not the same as the volume stripe width 512        (KEA: I’m not sure if this is an actual problem or not from googling around)

That's telling you there's a problem with your stripe alignment
setup somewhere. LVM is telling XFS that a total stripe width
of 256k, not 1MB. So it's likely your LVM setup isn't aligned/sized
properly to the RAID6 volume.

> meta-data=/dev/mapper/vgdata-lvdata isize=512    agcount=73, agsize=268435424 blks

73 ags.

<....>

> Our application writes to a nested directory hierarchy as follows:
> 
> <THREAD>/<DATE>/<HOUR>/<MINUTE>/<DATE>-<HOUR><MINUTE<SECOND>.data

<snip>

> What we’re seeing is that write performance starts off around 1400-1500 MB/s, decreasing approximately linearly all the way down to around ~600 MB/s after ~18 minutes before suddenly shooting back up to 1400-1500 MB/s. This cycle continues, with the crest and troughs slowly decreasing as the disk fills up (which I believe is expected).
> 
> We tried running it with 2 threads. We saw the same degradation and recovery performance profile, except it took ~36 minutes to bottom out and recover. Likewise, with only 1 thread it took ~72 minutes. In all cases the pattern continued until the disk was full.

72 minute cycle. Coincidence that it matches the AG count? Not at
all.

Once a minute, the workload changes directory. The directory for
the next minute gets put in the next AG. Over 73 minutes, we
have a set of files in 73 AGs. AG 0 is at the outer edge of
all the disks in the LUN. AG 73 is at the inner edge of all the
disks in the lun.

Typical manufacturer quoted transfer speed for spinning rust is
the outer edge (usually >200MB/s these days), so if we take into
latencies involved in writing to all disks at once, 190MB/s per disk
gives 1500MB/s.

However, at the inner edge of the disks, transfer rates to the media
are usually in the range of 50-100MB/s. 8x75MB/s = 600MB/s.

The cycle time was halved for two threads because there are 2
directories per minute, so it cycles through the 73 AGs at twice the
rate.

Essentially, XFS is demonstrating the exact performance of your
underlying array.

> We thought perhaps the directory structure was problematic, so we tried the following directory structure too: <THREAD>/<DATE>/<DATE>-<HOUR><MINUTE<SECOND>.data. This also had one file per second. This time, it took about 12.2 hours for the performance to bottom out before instantly shooting back up again.

Yup, this time you'll probably find it slowly walked AGs until it
ran out of stripe unit aligned free space, then it went back to AG 0
where the parent directory is and started filling holes.

So, two things - there's probably an issue with your stripe
alignment, and the behaviour you are seeing is a direct result of
the way XFS physically isolates per-directory data and the
underlying device performance across it's address space.

> Some other notes:
> 
> - I ran the same test with an Adaptec RAID controller as well,
> which gave the same performance profile.

It should.

> - I ran the same test with an ext4 filesystem just to see if it
> gave the same performance profile. It did not - the performance
> slowly degraded over time before a quick dropoff as the disk
> reached max capacity. I expected a different profile, but just
> wanted to run something to make sure that would be the case.

Also as expected. ext4 fills from the outer edge to the inner edge -
it does not segregate directories to different regions of the
filesystem like XFS does.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: RAID6 + XFS + Application Workload write performance question
  2017-10-30 16:43 RAID6 + XFS + Application Workload write performance question Kyle Ames
  2017-10-30 21:39 ` Dave Chinner
@ 2017-10-31 11:06 ` Emmanuel Florac
  1 sibling, 0 replies; 3+ messages in thread
From: Emmanuel Florac @ 2017-10-31 11:06 UTC (permalink / raw)
  To: Kyle Ames; +Cc: linux-xfs

[-- Attachment #1: Type: text/plain, Size: 1501 bytes --]

Le Mon, 30 Oct 2017 16:43:15 +0000
Kyle Ames <kyle.ames@FireEye.com> écrivait:

> OS: CentOS 7.3
> Kernel: 3.10.0-693.2.2.el7.x86_64
> RAID: LSI RAID controller - RAID 6 with 10 disks - Strip Size 128
> (and thus a stripe size of 1MB if I understand correctly) LVM: One
> PV, VG, and LV is built on top of the RAID6. The output is below.
> 
> RAID output:
> -----------------------------------------------------------------
> DG/VD TYPE  State Access Consist Cache Cac sCC      Size Name
> -----------------------------------------------------------------
> 0/0   RAID6 Optl  RW     No      RAWBD -   ON  72.761 TB DATA
> 
> ----------------------------------------------------------------------------
> EID:Slt DID State DG     Size Intf Med SED PI SeSz Model
> Sp Type
> ----------------------------------------------------------------------------
> 18:0     35 Onln   0 9.094 TB SAS  HDD N   N  512B ST10000NM0096
> U  - 18:1     39 Onln   0 9.094 TB SAS  HDD N   N  512B
> ST10000NM0096    U  -

Please note that 10TB disks are using 4k blocks. It's not the culprit
here, but using 512b emulated definitely isn't optimal for RAID
performance.

-- 
------------------------------------------------------------------------
Emmanuel Florac     |   Direction technique
                    |   Intellique
                    |	<eflorac@intellique.com>
                    |   +33 1 78 94 84 02
------------------------------------------------------------------------

[-- Attachment #2: Signature digitale OpenPGP --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2017-10-31 11:06 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-30 16:43 RAID6 + XFS + Application Workload write performance question Kyle Ames
2017-10-30 21:39 ` Dave Chinner
2017-10-31 11:06 ` Emmanuel Florac

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.