* ENSOPC on a 10% used disk
@ 2018-10-17 7:52 Avi Kivity
2018-10-17 8:47 ` Christoph Hellwig
` (3 more replies)
0 siblings, 4 replies; 26+ messages in thread
From: Avi Kivity @ 2018-10-17 7:52 UTC (permalink / raw)
To: linux-xfs
I have a user running a 1.7TB filesystem with ~10% usage (as shown by
df), getting sporadic ENOSPC errors. The disk is mounted with inode64
and has a relatively small number of large files. The disk is a
single-member RAID0 array, with 1MB chunk size. There are 32 AGs.
Running Linux 4.9.17.
The write load consists of AIO/DIO writes, followed by unlinks of these
files. The writes are non-size-changing (we truncate ahead) and we use
XFS_IOC_FSSETXATTR/XFS_FLAG_EXTSIZE with a hint size of 32MB. The errors
happen on commit logs, which have a target size of 32MB (but may exceed
it a little).
The errors are sporadic and after restarting the workload they go away
for a few hours to a few days, but then return. During one of the
crashes I used xfs_db to look at fragmentation and saw that most AGs had
free extents of size categories up to 128-255, but a few had more. I
tried xfs_fsr but it did not help.
Is this a known issue? Would upgrading the kernel help?
I'll try to get a metadata dump next time this happens, and I'll be
happy to supply more information.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: ENSOPC on a 10% used disk
2018-10-17 7:52 ENSOPC on a 10% used disk Avi Kivity
@ 2018-10-17 8:47 ` Christoph Hellwig
2018-10-17 8:57 ` Avi Kivity
2018-10-18 1:37 ` Dave Chinner
` (2 subsequent siblings)
3 siblings, 1 reply; 26+ messages in thread
From: Christoph Hellwig @ 2018-10-17 8:47 UTC (permalink / raw)
To: Avi Kivity; +Cc: linux-xfs
On Wed, Oct 17, 2018 at 10:52:48AM +0300, Avi Kivity wrote:
> I have a user running a 1.7TB filesystem with ~10% usage (as shown by df),
> getting sporadic ENOSPC errors. The disk is mounted with inode64 and has a
> relatively small number of large files. The disk is a single-member RAID0
> array, with 1MB chunk size. There are 32 AGs. Running Linux 4.9.17.
4.9.17 is rather old and you'll have a hard time finding someone
familiar with it..
> Is this a known issue? Would upgrading the kernel help?
Two things that come to mind:
- are you sure there is no open fd to the unlinked files? That would
keep the space allocated until the last link is dropped.
- even once we drop the inode the space only becomes available once
the transaction has committed. We do force the log if we found
a busy extent, but there might be some issues. Try seeing if you
hit the xfs_extent_busy_force trace point with your workload.
- if you have online discard (-o discard) enabled there might be
more issues like the above, especially on old kernels.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: ENSOPC on a 10% used disk
2018-10-17 8:47 ` Christoph Hellwig
@ 2018-10-17 8:57 ` Avi Kivity
2018-10-17 10:54 ` Avi Kivity
0 siblings, 1 reply; 26+ messages in thread
From: Avi Kivity @ 2018-10-17 8:57 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: linux-xfs
On 17/10/2018 11.47, Christoph Hellwig wrote:
> On Wed, Oct 17, 2018 at 10:52:48AM +0300, Avi Kivity wrote:
>> I have a user running a 1.7TB filesystem with ~10% usage (as shown by df),
>> getting sporadic ENOSPC errors. The disk is mounted with inode64 and has a
>> relatively small number of large files. The disk is a single-member RAID0
>> array, with 1MB chunk size. There are 32 AGs. Running Linux 4.9.17.
> 4.9.17 is rather old and you'll have a hard time finding someone
> familiar with it..
Yes. I expect my user will agree to upgrade, but I'd like to recommend
this only if we know there was a real issue and it was resolved, not on
general principles.
>> Is this a known issue? Would upgrading the kernel help?
> Two things that come to mind:
>
> - are you sure there is no open fd to the unlinked files? That would
> keep the space allocated until the last link is dropped.
"df" would report that space as occupied, no?
I believe a colleague verified there were no deleted files but I'm not
100% sure.
> - even once we drop the inode the space only becomes available once
> the transaction has committed. We do force the log if we found
> a busy extent, but there might be some issues. Try seeing if you
> hit the xfs_extent_busy_force trace point with your workload.
I'll ask permission to check this and report.
> - if you have online discard (-o discard) enabled there might be
> more issues like the above, especially on old kernels.
Online discard is not enabled:
/dev/md0 on /var/lib/scylla type xfs
(rw,noatime,attr2,inode64,sunit=2048,swidth=2048,noquota)
btw, we've seen fstrim on an old disk (that was likely never trimmed)
improving its performance by a factor of ~100, so my interest in -o
discard is re-awakening. Is it good enough now to to run on aio
workloads (assuming nvme) or is more work needed? My prime concern is to
avoid io_submit sleeping.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: ENSOPC on a 10% used disk
2018-10-17 8:57 ` Avi Kivity
@ 2018-10-17 10:54 ` Avi Kivity
0 siblings, 0 replies; 26+ messages in thread
From: Avi Kivity @ 2018-10-17 10:54 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: linux-xfs
On 17/10/2018 11.57, Avi Kivity wrote:
>
>> - even once we drop the inode the space only becomes available once
>> the transaction has committed. We do force the log if we found
>> a busy extent, but there might be some issues. Try seeing if you
>> hit the xfs_extent_busy_force trace point with your workload.
>
>
> I'll ask permission to check this and report.
>
>
An hour's tracing yielded zero hits. Of course, that says nothing about
other times, I'll continue to trace.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: ENSOPC on a 10% used disk
2018-10-17 7:52 ENSOPC on a 10% used disk Avi Kivity
2018-10-17 8:47 ` Christoph Hellwig
@ 2018-10-18 1:37 ` Dave Chinner
2018-10-18 7:55 ` Avi Kivity
2018-10-18 15:54 ` Eric Sandeen
2019-02-05 21:48 ` Dave Chinner
3 siblings, 1 reply; 26+ messages in thread
From: Dave Chinner @ 2018-10-18 1:37 UTC (permalink / raw)
To: Avi Kivity; +Cc: linux-xfs
On Wed, Oct 17, 2018 at 10:52:48AM +0300, Avi Kivity wrote:
> I have a user running a 1.7TB filesystem with ~10% usage (as shown
> by df), getting sporadic ENOSPC errors. The disk is mounted with
> inode64 and has a relatively small number of large files. The disk
> is a single-member RAID0 array, with 1MB chunk size. There are 32
> AGs. Running Linux 4.9.17.
ENOSPC on what operation? write? open(O_CREAT)? something else?
What's the filesystem config (xfs_info output)?
> The write load consists of AIO/DIO writes, followed by unlinks of
> these files. The writes are non-size-changing (we truncate ahead)
> and we use XFS_IOC_FSSETXATTR/XFS_FLAG_EXTSIZE with a hint size of
> 32MB. The errors happen on commit logs, which have a target size of
> 32MB (but may exceed it a little).
>
>
> The errors are sporadic and after restarting the workload they go
> away for a few hours to a few days, but then return. During one of
> the crashes I used xfs_db to look at fragmentation and saw that most
> AGs had free extents of size categories up to 128-255, but a few had
> more. I tried xfs_fsr but it did not help.
32MB extents are 8192 blocks. The bucket 128-255 records extents
between 512k and 1MB in size, so it sounds like free space has been
fragmented to death. Has xfs_fsr been run on this filesystem
regularly?
If the ENOSPC errors are only from files with a 32MB extent size
hints on them, then it may be that there isn't sufficient contiguous
free space to allocate an entire 32MB extent. I'm not sure what the
allocator behaviour here is (the code is a maze of twisty passages),
so I'll have to look more into this.
In the mean time, can you post the output of the freespace command
(both global and per-ag) so we can see just how much free space
there is and how badly fragmented it has become? I might be able to
reproduce the behaviour if I know the conditions under which it is
occuring.
> Is this a known issue? Would upgrading the kernel help?
Not that I know of. If it's an extszhint vs free space fragmentation
issue, then a kernel upgrade is unlikely to fix it.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: ENSOPC on a 10% used disk
2018-10-18 1:37 ` Dave Chinner
@ 2018-10-18 7:55 ` Avi Kivity
2018-10-18 10:05 ` Dave Chinner
0 siblings, 1 reply; 26+ messages in thread
From: Avi Kivity @ 2018-10-18 7:55 UTC (permalink / raw)
To: Dave Chinner; +Cc: linux-xfs
On 18/10/2018 04.37, Dave Chinner wrote:
> On Wed, Oct 17, 2018 at 10:52:48AM +0300, Avi Kivity wrote:
>> I have a user running a 1.7TB filesystem with ~10% usage (as shown
>> by df), getting sporadic ENOSPC errors. The disk is mounted with
>> inode64 and has a relatively small number of large files. The disk
>> is a single-member RAID0 array, with 1MB chunk size. There are 32
>> AGs. Running Linux 4.9.17.
> ENOSPC on what operation? write? open(O_CREAT)? something else?
Unknown.
> What's the filesystem config (xfs_info output)?
(restored from metadata dump)
meta-data=/dev/loop2 isize=512 agcount=32,
agsize=14494720 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=0 spinodes=0 rmapbt=0
= reflink=0
data = bsize=4096 blocks=463831040, imaxpct=5
= sunit=256 swidth=256 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=1
log =internal bsize=4096 blocks=226480, version=2
= sectsz=512 sunit=8 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
>> The write load consists of AIO/DIO writes, followed by unlinks of
>> these files. The writes are non-size-changing (we truncate ahead)
>> and we use XFS_IOC_FSSETXATTR/XFS_FLAG_EXTSIZE with a hint size of
>> 32MB. The errors happen on commit logs, which have a target size of
>> 32MB (but may exceed it a little).
>>
>>
>> The errors are sporadic and after restarting the workload they go
>> away for a few hours to a few days, but then return. During one of
>> the crashes I used xfs_db to look at fragmentation and saw that most
>> AGs had free extents of size categories up to 128-255, but a few had
>> more. I tried xfs_fsr but it did not help.
> 32MB extents are 8192 blocks. The bucket 128-255 records extents
> between 512k and 1MB in size, so it sounds like free space has been
> fragmented to death. Has xfs_fsr been run on this filesystem
> regularly?
xfs_fsr has never been run, until we saw the problem (and then did not
fix it). IIUC the workload should be self-defragmenting: it consists of
writing large files, then erasing them. I estimate that around 100 files
are written concurrently (from 14 threads), and they are written with
large extent hints. With every large file, another smaller (but still
large) file is written, and a few smallish metadata files.
I understood from xfs_fsr that it attempts to defragment files, not free
space, although that may come as a side effect. In any case I ran xfs_db
after xfs_fsr and did not see an improvement.
>
> If the ENOSPC errors are only from files with a 32MB extent size
> hints on them, then it may be that there isn't sufficient contiguous
> free space to allocate an entire 32MB extent. I'm not sure what the
> allocator behaviour here is (the code is a maze of twisty passages),
> so I'll have to look more into this.
There are other files with 32MB hints that do not show the error (but on
the other hand, the error has been observed few enough times for that to
be a fluke).
>
> In the mean time, can you post the output of the freespace command
> (both global and per-ag) so we can see just how much free space
> there is and how badly fragmented it has become? I might be able to
> reproduce the behaviour if I know the conditions under which it is
> occuring.
xfs_db> freesp
from to extents blocks pct
1 1 5916 5916 0.00
2 3 10235 22678 0.01
4 7 12251 66829 0.02
8 15 5521 59556 0.01
16 31 5703 132031 0.03
32 63 9754 463825 0.11
64 127 16742 1590339 0.37
128 255 1550511 390108625 89.87
256 511 71516 29178504 6.72
512 1023 19 15355 0.00
1024 2047 287 461824 0.11
2048 4095 528 1611413 0.37
4096 8191 1537 10352304 2.38
8192 16383 2 19015 0.00
Just 2 extents >= 32MB (and they may have been freed after the error).
Per-ag:
from to extents blocks pct
1 1 390 390 0.00
2 3 542 1215 0.01
4 7 590 3211 0.02
8 15 265 2735 0.02
16 31 219 5000 0.04
32 63 323 15530 0.11
64 127 620 58217 0.43
128 255 48677 12254686 90.27
256 511 2981 1234365 9.09
from to extents blocks pct
1 1 542 542 0.00
2 3 646 1495 0.01
4 7 592 3122 0.02
8 15 525 5937 0.04
16 31 539 12280 0.09
32 63 691 33226 0.25
64 127 851 78277 0.59
128 255 46390 11658684 88.21
256 511 3335 1422955 10.77
from to extents blocks pct
1 1 560 560 0.00
2 3 642 1454 0.01
4 7 483 2552 0.02
8 15 368 4020 0.03
16 31 440 9947 0.08
32 63 540 25347 0.21
64 127 733 67944 0.56
128 255 42337 10632366 87.06
256 511 3386 1438609 11.78
512 1023 5 4423 0.04
1024 2047 5 8649 0.07
2048 4095 3 9205 0.08
4096 8191 1 8191 0.07
from to extents blocks pct
1 1 662 662 0.01
2 3 675 1545 0.02
4 7 490 2483 0.03
8 15 414 4485 0.05
16 31 445 9915 0.11
32 63 540 25279 0.29
64 127 683 63014 0.72
128 255 10061 2483774 28.34
256 511 1498 574685 6.56
512 1023 9 6715 0.08
1024 2047 5 6967 0.08
2048 4095 100 354101 4.04
4096 8191 786 5229818 59.68
from to extents blocks pct
1 1 642 642 0.01
2 3 705 1599 0.02
4 7 545 2801 0.04
8 15 407 4320 0.05
16 31 410 9396 0.12
32 63 513 24294 0.31
64 127 528 48217 0.61
128 255 2723 644939 8.17
256 511 875 326064 4.13
512 1023 5 4217 0.05
1024 2047 277 446208 5.65
2048 4095 425 1248107 15.81
4096 8191 750 5114295 64.79
8192 16383 2 19015 0.24
from to extents blocks pct
1 1 176 176 0.00
2 3 484 1228 0.01
4 7 825 4277 0.03
8 15 73 870 0.01
16 31 174 4155 0.03
32 63 356 16746 0.12
64 127 597 58761 0.42
128 255 55401 13814803 99.38
from to extents blocks pct
1 1 182 182 0.00
2 3 212 444 0.00
4 7 32 188 0.00
8 15 58 692 0.00
16 31 102 2369 0.02
32 63 243 11756 0.08
64 127 449 43271 0.30
128 255 53882 13618288 95.22
256 511 1550 625387 4.37
from to extents blocks pct
1 1 147 147 0.00
2 3 203 426 0.00
4 7 287 1585 0.01
8 15 84 958 0.01
16 31 105 2370 0.02
32 63 243 12073 0.09
64 127 497 47704 0.34
128 255 51847 13080484 94.15
256 511 1897 747986 5.38
from to extents blocks pct
1 1 81 81 0.00
2 3 129 262 0.00
4 7 186 1070 0.01
8 15 148 1781 0.01
16 31 225 5411 0.04
32 63 257 12226 0.09
64 127 492 46230 0.33
128 255 53802 13533984 95.16
256 511 1574 621876 4.37
from to extents blocks pct
1 1 159 159 0.00
2 3 191 398 0.00
4 7 182 1009 0.01
8 15 63 730 0.01
16 31 88 2006 0.01
32 63 191 9044 0.06
64 127 494 46669 0.33
128 255 53441 13451913 94.51
256 511 1850 720941 5.07
from to extents blocks pct
1 1 156 156 0.00
2 3 192 397 0.00
4 7 169 948 0.01
8 15 67 780 0.01
16 31 115 2948 0.02
32 63 272 12564 0.09
64 127 511 49124 0.35
128 255 53339 13427444 94.42
256 511 1866 726347 5.11
from to extents blocks pct
1 1 157 157 0.00
2 3 171 364 0.00
4 7 221 1215 0.01
8 15 45 504 0.00
16 31 116 2628 0.02
32 63 249 11827 0.08
64 127 474 47158 0.33
128 255 53261 13409025 94.35
256 511 1886 738689 5.20
from to extents blocks pct
1 1 142 142 0.00
2 3 181 395 0.00
4 7 323 1753 0.01
8 15 108 1176 0.01
16 31 134 3069 0.02
32 63 260 12055 0.08
64 127 411 39107 0.28
128 255 53197 13389340 94.39
256 511 1877 737582 5.20
from to extents blocks pct
1 1 137 137 0.00
2 3 174 386 0.00
4 7 222 1232 0.01
8 15 93 1012 0.01
16 31 96 2192 0.02
32 63 223 10763 0.08
64 127 493 47665 0.34
128 255 53125 13374075 94.17
256 511 1949 764710 5.38
from to extents blocks pct
1 1 59 59 0.00
2 3 138 309 0.00
4 7 224 1217 0.01
8 15 104 1211 0.01
16 31 138 3352 0.02
32 63 337 16480 0.12
64 127 585 55922 0.39
128 255 53654 13487724 95.05
256 511 1589 623688 4.40
from to extents blocks pct
1 1 121 121 0.00
2 3 264 597 0.00
4 7 706 3907 0.03
8 15 174 1802 0.01
16 31 94 2243 0.02
32 63 228 10806 0.08
64 127 495 47228 0.34
128 255 52078 13106646 93.94
256 511 1953 779417 5.59
from to extents blocks pct
1 1 107 107 0.00
2 3 174 370 0.00
4 7 248 1401 0.01
8 15 115 1318 0.01
16 31 111 2561 0.02
32 63 218 10243 0.07
64 127 443 42493 0.30
128 255 52320 13168357 94.43
256 511 1828 717948 5.15
from to extents blocks pct
1 1 126 126 0.00
2 3 353 793 0.01
4 7 774 4297 0.03
8 15 174 1767 0.01
16 31 129 3135 0.02
32 63 317 14569 0.11
64 127 506 48326 0.35
128 255 51507 12956078 93.58
256 511 2055 815607 5.89
from to extents blocks pct
1 1 118 118 0.00
2 3 207 448 0.00
4 7 299 1694 0.01
8 15 91 960 0.01
16 31 104 2394 0.02
32 63 358 17378 0.12
64 127 497 47351 0.34
128 255 52540 13229046 93.84
256 511 1971 798192 5.66
from to extents blocks pct
1 1 105 105 0.00
2 3 261 571 0.00
4 7 333 1851 0.01
8 15 100 1009 0.01
16 31 137 3323 0.02
32 63 261 12069 0.09
64 127 482 45103 0.32
128 255 51909 13060192 93.20
256 511 2226 889345 6.35
from to extents blocks pct
1 1 111 111 0.00
2 3 221 471 0.00
4 7 243 1341 0.01
8 15 101 1002 0.01
16 31 87 2145 0.02
32 63 265 12987 0.09
64 127 429 41335 0.29
128 255 51818 13031610 92.85
256 511 2312 944418 6.73
from to extents blocks pct
1 1 89 89 0.00
2 3 245 542 0.00
4 7 383 2114 0.02
8 15 107 1117 0.01
16 31 153 3505 0.03
32 63 237 11431 0.08
64 127 489 46582 0.33
128 255 51377 12929850 92.48
256 511 2412 986093 7.05
from to extents blocks pct
1 1 83 83 0.00
2 3 253 536 0.00
4 7 341 1902 0.01
8 15 118 1269 0.01
16 31 137 3201 0.02
32 63 235 11096 0.08
64 127 432 41041 0.30
128 255 51165 12882960 92.73
256 511 2348 951207 6.85
from to extents blocks pct
1 1 63 63 0.00
2 3 263 570 0.00
4 7 427 2392 0.02
8 15 143 1536 0.01
16 31 117 2714 0.02
32 63 217 10510 0.08
64 127 402 38021 0.27
128 255 50857 12803884 91.91
256 511 2583 1071722 7.69
from to extents blocks pct
1 1 69 69 0.00
2 3 302 645 0.00
4 7 343 1884 0.01
8 15 120 1234 0.01
16 31 133 3184 0.02
32 63 215 9971 0.07
64 127 506 49464 0.35
128 255 49778 12542384 89.34
256 511 3333 1429372 10.18
from to extents blocks pct
1 1 62 62 0.00
2 3 300 652 0.00
4 7 432 2413 0.02
8 15 173 1814 0.01
16 31 92 2119 0.02
32 63 253 12006 0.09
64 127 439 43006 0.31
128 255 49809 12539975 89.53
256 511 3298 1403687 10.02
from to extents blocks pct
1 1 52 52 0.00
2 3 283 608 0.00
4 7 253 1382 0.01
8 15 126 1353 0.01
16 31 117 2653 0.02
32 63 226 10856 0.08
64 127 462 43181 0.31
128 255 50799 12805008 90.86
256 511 2899 1228715 8.72
from to extents blocks pct
1 1 53 53 0.00
2 3 322 683 0.00
4 7 473 2658 0.02
8 15 206 2134 0.02
16 31 149 3494 0.03
32 63 251 12271 0.09
64 127 548 52541 0.38
128 255 50353 12685959 91.22
256 511 2753 1146454 8.24
from to extents blocks pct
1 1 46 46 0.00
2 3 309 655 0.00
4 7 373 2108 0.02
8 15 181 1951 0.01
16 31 161 3795 0.03
32 63 270 12433 0.09
64 127 434 41689 0.30
128 255 50963 12821420 91.99
256 511 2604 1054433 7.56
from to extents blocks pct
1 1 121 121 0.00
2 3 357 779 0.01
4 7 337 1825 0.01
8 15 220 2378 0.02
16 31 181 4124 0.03
32 63 297 13987 0.10
64 127 571 53694 0.39
128 255 49880 12560088 91.06
256 511 2792 1155483 8.38
from to extents blocks pct
1 1 235 235 0.00
2 3 439 964 0.01
4 7 448 2445 0.02
8 15 275 2842 0.02
16 31 221 4979 0.04
32 63 332 15967 0.12
64 127 596 56251 0.41
128 255 48484 12208089 89.11
256 511 3341 1408614 10.28
from to extents blocks pct
1 1 163 163 0.00
2 3 397 877 0.01
4 7 467 2552 0.02
8 15 275 2859 0.02
16 31 234 5424 0.04
32 63 336 16035 0.12
64 127 593 55753 0.41
128 255 49737 12515550 91.40
256 511 2695 1093913 7.99
>> Is this a known issue? Would upgrading the kernel help?
> Not that I know of. If it's an extszhint vs free space fragmentation
> issue, then a kernel upgrade is unlikely to fix it.
>
> Cheers,
>
> Dave.
>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: ENSOPC on a 10% used disk
2018-10-18 7:55 ` Avi Kivity
@ 2018-10-18 10:05 ` Dave Chinner
2018-10-18 11:00 ` Avi Kivity
0 siblings, 1 reply; 26+ messages in thread
From: Dave Chinner @ 2018-10-18 10:05 UTC (permalink / raw)
To: Avi Kivity; +Cc: linux-xfs
[ hmmm, there's some whacky utf-8 whitespace characters in the
copy-n-pasted text... ]
On Thu, Oct 18, 2018 at 10:55:18AM +0300, Avi Kivity wrote:
>
> On 18/10/2018 04.37, Dave Chinner wrote:
> >On Wed, Oct 17, 2018 at 10:52:48AM +0300, Avi Kivity wrote:
> >>I have a user running a 1.7TB filesystem with ~10% usage (as shown
> >>by df), getting sporadic ENOSPC errors. The disk is mounted with
> >>inode64 and has a relatively small number of large files. The disk
> >>is a single-member RAID0 array, with 1MB chunk size. There are 32
Ok, now I need to know what "single member RAID0 array" means,
becuase this is clearly related to allocation alignment and I need
to know why the FS was configured the way it was.
It's one disk? Or is it a hardware RAID0 array that presents as a
single lun with a stripe width of 1MB? if so, how many disks aer in
it? If the chunk size the stripe unit (per disk chunk size) or the
stripe width (all disks get hit by a 1MB IO)
Or something else?
> >>AGs. Running Linux 4.9.17.
> >ENOSPC on what operation? write? open(O_CREAT)? something else?
>
>
> Unknown.
>
>
> >What's the filesystem config (xfs_info output)?
>
>
> (restored from metadata dump)
>
>
> meta-data=/dev/loop2 isize=512 agcount=32, agsize=14494720 blks
> = sectsz=512 attr=2, projid32bit=1
> = crc=1 finobt=0 spinodes=0 rmapbt=0
> = reflink=0
> data = bsize=4096 blocks=463831040, imaxpct=5
> = sunit=256 swidth=256 blks
sunit=swidth is unusual for a RAID0 array, unless it's hardware RAID
and the array only reports one number to mkfs. What this chosen by
mkfs, or specifically configured by the user? If specifically
configured, why?
What is important is that it means aligned allocations will be used
for any allocation that is over sunit (1MB) and that's where all the
problems seem to come from.
> naming =version 2 bsize=4096 ascii-ci=0 ftype=1
> log =internal bsize=4096 blocks=226480, version=2
> = sectsz=512 sunit=8 blks, lazy-count=1
> realtime =none extsz=4096 blocks=0, rtextents=0
>
> > Has xfs_fsr been run on this filesystem
> >regularly?
>
>
> xfs_fsr has never been run, until we saw the problem (and then did
> not fix it). IIUC the workload should be self-defragmenting: it
> consists of writing large files, then erasing them. I estimate that
> around 100 files are written concurrently (from 14 threads), and
> they are written with large extent hints. With every large file,
> another smaller (but still large) file is written, and a few
> smallish metadata files.
Do those smaller files get removed when the big files are removed?
> I understood from xfs_fsr that it attempts to defragment files, not
> free space, although that may come as a side effect. In any case I
> ran xfs_db after xfs_fsr and did not see an improvement.
xfs_fsr takes fragmented files and contiguous free space and turns
it into contiguous files and fragmented free space. You have
fragmented free space, so I needed to know if xfs_fsr was
responsible for that....
> >If the ENOSPC errors are only from files with a 32MB extent size
> >hints on them, then it may be that there isn't sufficient contiguous
> >free space to allocate an entire 32MB extent. I'm not sure what the
> >allocator behaviour here is (the code is a maze of twisty passages),
> >so I'll have to look more into this.
>
> There are other files with 32MB hints that do not show the error
> (but on the other hand, the error has been observed few enough times
> for that to be a fluke).
*nod*
> >In the mean time, can you post the output of the freespace command
> >(both global and per-ag) so we can see just how much free space
> >there is and how badly fragmented it has become? I might be able to
> >reproduce the behaviour if I know the conditions under which it is
> >occuring.
>
>
> xfs_db> freesp
> from to extents blocks pct
> 1 1 5916 5916 0.00
> 2 3 10235 22678 0.01
> 4 7 12251 66829 0.02
> 8 15 5521 59556 0.01
> 16 31 5703 132031 0.03
> 32 63 9754 463825 0.11
> 64 127 16742 1590339 0.37
> 128 255 550511 390108625 89.87
> 256 511 71516 29178504 6.72
> 512 1023 19 15355 0.00
> 1024 2047 287 461824 0.11
> 2048 4095 528 1611413 0.37
> 4096 8191 1537 10352304 2.38
> 8192 16383 2 19015 0.00
>
> Just 2 extents >= 32MB (and they may have been freed after the error).
Yes, and the vast majority of free space is in lengths between 512kB
and 1020kB. This is what I'd expect if you have large, stripe
aligned allocations interleaved with smaller, sub-stripe unit
allocations.
As an example of behaviour that can leads to this sort of free space
fragmentation, start with 10 stripe units of contiguous free space:
0 1 2 3 4 5 6 7 8 9 10
+----+----+----+----+----+----+----+----+----+----+----+
Now allocate a > stripe unit extent (say 2 units):
0 1 2 3 4 5 6 7 8 9 10
LLLLLLLLLL+----+----+----+----+----+----+----+----+----+
Now allocate a small file A:
0 1 2 3 4 5 6 7 8 9 10
LLLLLLLLLLAA---+----+----+----+----+----+----+----+----+
Now allocate another large extent:
0 1 2 3 4 5 6 7 8 9 10
LLLLLLLLLLAA---LLLLLLLLLL+----+----+----+----+----+----+
After a while, a significant part of your filesystem looks like
this repeating pattern:
0 1 2 3 4 5 6 7 8 9 10
LLLLLLLLLLAA---LLLLLLLLLLBB---LLLLLLLLLLCC---LLLLLLLLLLDD---+
i.e. there are lots of small, isolated sub stripe unit free spaces.
If you now start removing large extents but leaving the small
files behind, you end up with this:
0 1 2 3 4 5 6 7 8 9 10
LLLLLLLLLLAA---+---------BB---LLLLLLLLLLCC---+----+----DD---+
And now we go to allocate a new large+small file pair (M+n)
they'll get laid out like this:
0 1 2 3 4 5 6 7 8 9 10
LLLLLLLLLLAA---MMMMMMMMMMBB---LLLLLLLLLLCC---nn---+----DD---+
See how we lost a large aligned 2MB freespace @ 9 when the small
file "nn" was laid down? repeat this fill and free pattern over and
over again, and eventually it fragments the free space until there's
no large contiguous free spaces left, and large aligned extents can
no longer be allocated.
For this to trigger you need the small files to be larger than 1
stripe unit, but still much smaller than the extent size hint, and
the small files need to hang around as the large files come and go.
> >>Is this a known issue?
The effect and symptom is - it's a generic large aligned extent vs small unaligned extent
issue, but I've never seen it manifest in a user workload outside of
a very constrained multistream realtime video ingest/playout
workload (i.e. the workload the filestreams allocator was written
for). And before you ask, no, the filestreams allocator does not
solve this problem.
The most common manifestation of this problem has been inode
allocation on filesystems full of small files - inodes are allocated
in large aligned extents compared to small files, and so eventually
the filesystem runs out of large contigouous freespace and inodes
can't be allocated. The sparse inodes mkfs option fixed this by
allowing inodes to be allocated as sparse chunks so they could
interleave into any free space available....
> >>Would upgrading the kernel help?
> >Not that I know of. If it's an extszhint vs free space fragmentation
> >issue, then a kernel upgrade is unlikely to fix it.
Upgrading the kernel won't fix it, because it's an extszhint vs free
space fragmentation issue.
Filesystems that get into this state are generally considered
unrecoverable. Well, you can recover them by deleting everythign
from them to reform contiguous free space, but you may as well just
mkfs and restore from backup because it's much, much faster than
waiting for rm -rf....
And, really, I expect that a different filesystem geometry and/or
mount options are going to be needed to avoid getting into this
state again. However, I don't really know enough yet about what in
the workload and allocator is triggering to cause the issue to say
yet.
Can I get access to the metadump to dig around in the filesystem
directly so I can see how everything has ended up laid out? that
will help me work out what is actually occurring and determine if
mkfs/mount options can address the problem or whether deeper
allocator algorithm changes may be necessary....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: ENSOPC on a 10% used disk
2018-10-18 10:05 ` Dave Chinner
@ 2018-10-18 11:00 ` Avi Kivity
2018-10-18 13:36 ` Avi Kivity
` (2 more replies)
0 siblings, 3 replies; 26+ messages in thread
From: Avi Kivity @ 2018-10-18 11:00 UTC (permalink / raw)
To: Dave Chinner; +Cc: linux-xfs
On 18/10/2018 13.05, Dave Chinner wrote:
> [ hmmm, there's some whacky utf-8 whitespace characters in the
> copy-n-pasted text... ]
It's a brave new world out there.
> On Thu, Oct 18, 2018 at 10:55:18AM +0300, Avi Kivity wrote:
>> On 18/10/2018 04.37, Dave Chinner wrote:
>>> On Wed, Oct 17, 2018 at 10:52:48AM +0300, Avi Kivity wrote:
>>>> I have a user running a 1.7TB filesystem with ~10% usage (as shown
>>>> by df), getting sporadic ENOSPC errors. The disk is mounted with
>>>> inode64 and has a relatively small number of large files. The disk
>>>> is a single-member RAID0 array, with 1MB chunk size. There are 32
> Ok, now I need to know what "single member RAID0 array" means,
> becuase this is clearly related to allocation alignment and I need
> to know why the FS was configured the way it was.
It's a Linux RAID device, /dev/md0.
We configure it this way so that it's easy to add storage (okay, the
real reason is probably to avoid special casing one drive).
>
> It's one disk? Or is it a hardware RAID0 array that presents as a
> single lun with a stripe width of 1MB? if so, how many disks aer in
> it? If the chunk size the stripe unit (per disk chunk size) or the
> stripe width (all disks get hit by a 1MB IO)
>
> Or something else?
One disk, organized into a Linux RAID device with just one member.
>
>>>> AGs. Running Linux 4.9.17.
>>> ENOSPC on what operation? write? open(O_CREAT)? something else?
>>
>> Unknown.
>>
>>
>>> What's the filesystem config (xfs_info output)?
>>
>> (restored from metadata dump)
>>
>>
>> meta-data=/dev/loop2 isize=512 agcount=32, agsize=14494720 blks
>> = sectsz=512 attr=2, projid32bit=1
>> = crc=1 finobt=0 spinodes=0 rmapbt=0
>> = reflink=0
>> data = bsize=4096 blocks=463831040, imaxpct=5
>> = sunit=256 swidth=256 blks
> sunit=swidth is unusual for a RAID0 array, unless it's hardware RAID
> and the array only reports one number to mkfs. What this chosen by
> mkfs, or specifically configured by the user? If specifically
> configured, why?
I'm guessing it's because it has one member? I'm guessing the usual is
swidth=sunit*nmembers?
Maybe that configuration confused xfs? Although we've been using it on
many instances.
>
> What is important is that it means aligned allocations will be used
> for any allocation that is over sunit (1MB) and that's where all the
> problems seem to come from.
Do these aligned allocations not fall back to non-aligned allocations if
they fail?
>
>> naming =version 2 bsize=4096 ascii-ci=0 ftype=1
>> log =internal bsize=4096 blocks=226480, version=2
>> = sectsz=512 sunit=8 blks, lazy-count=1
>> realtime =none extsz=4096 blocks=0, rtextents=0
>>
>>> Has xfs_fsr been run on this filesystem
>>> regularly?
>>
>> xfs_fsr has never been run, until we saw the problem (and then did
>> not fix it). IIUC the workload should be self-defragmenting: it
>> consists of writing large files, then erasing them. I estimate that
>> around 100 files are written concurrently (from 14 threads), and
>> they are written with large extent hints. With every large file,
>> another smaller (but still large) file is written, and a few
>> smallish metadata files.
> Do those smaller files get removed when the big files are removed?
Yes. It's more or less like this:
1. Create two big files, with 32MB hints
2. Append to the two files, using 128k AIO/DIO writes. We truncate ahead
so those writes are not size-changing.
3. Truncate those files to their final size, write ~5 much smaller files
using the same pattern
4. A bunch of fdatasyncs, renames, and directory fdatasyncs
5. The two big files get random reads for a random while
6. All files are unlinked (with some rename and directory fdatasyncs so
we can recover if we crash while doing that)
7. Rinse, repeat. The whole things happens in parallel for similar and
different filesizes and lifetimes.
The commitlog files (for which we've seen the error) are simpler: create
a file with 32MB extent hint, truncate to 32MB size, lots of writes
(which may not all be 128k).
>
>> I understood from xfs_fsr that it attempts to defragment files, not
>> free space, although that may come as a side effect. In any case I
>> ran xfs_db after xfs_fsr and did not see an improvement.
> xfs_fsr takes fragmented files and contiguous free space and turns
> it into contiguous files and fragmented free space. You have
> fragmented free space, so I needed to know if xfs_fsr was
> responsible for that....
I see.
>
>>> If the ENOSPC errors are only from files with a 32MB extent size
>>> hints on them, then it may be that there isn't sufficient contiguous
>>> free space to allocate an entire 32MB extent. I'm not sure what the
>>> allocator behaviour here is (the code is a maze of twisty passages),
>>> so I'll have to look more into this.
>> There are other files with 32MB hints that do not show the error
>> (but on the other hand, the error has been observed few enough times
>> for that to be a fluke).
> *nod*
>
>>> In the mean time, can you post the output of the freespace command
>>> (both global and per-ag) so we can see just how much free space
>>> there is and how badly fragmented it has become? I might be able to
>>> reproduce the behaviour if I know the conditions under which it is
>>> occuring.
>>
>> xfs_db> freesp
>> from to extents blocks pct
>> 1 1 5916 5916 0.00
>> 2 3 10235 22678 0.01
>> 4 7 12251 66829 0.02
>> 8 15 5521 59556 0.01
>> 16 31 5703 132031 0.03
>> 32 63 9754 463825 0.11
>> 64 127 16742 1590339 0.37
>> 128 255 550511 390108625 89.87
>> 256 511 71516 29178504 6.72
>> 512 1023 19 15355 0.00
>> 1024 2047 287 461824 0.11
>> 2048 4095 528 1611413 0.37
>> 4096 8191 1537 10352304 2.38
>> 8192 16383 2 19015 0.00
>>
>> Just 2 extents >= 32MB (and they may have been freed after the error).
> Yes, and the vast majority of free space is in lengths between 512kB
> and 1020kB. This is what I'd expect if you have large, stripe
> aligned allocations interleaved with smaller, sub-stripe unit
> allocations.
>
> As an example of behaviour that can leads to this sort of free space
> fragmentation, start with 10 stripe units of contiguous free space:
>
> 0 1 2 3 4 5 6 7 8 9 10
> +----+----+----+----+----+----+----+----+----+----+----+
>
> Now allocate a > stripe unit extent (say 2 units):
>
> 0 1 2 3 4 5 6 7 8 9 10
> LLLLLLLLLL+----+----+----+----+----+----+----+----+----+
>
> Now allocate a small file A:
>
> 0 1 2 3 4 5 6 7 8 9 10
> LLLLLLLLLLAA---+----+----+----+----+----+----+----+----+
>
> Now allocate another large extent:
>
> 0 1 2 3 4 5 6 7 8 9 10
> LLLLLLLLLLAA---LLLLLLLLLL+----+----+----+----+----+----+
>
> After a while, a significant part of your filesystem looks like
> this repeating pattern:
>
> 0 1 2 3 4 5 6 7 8 9 10
> LLLLLLLLLLAA---LLLLLLLLLLBB---LLLLLLLLLLCC---LLLLLLLLLLDD---+
>
> i.e. there are lots of small, isolated sub stripe unit free spaces.
> If you now start removing large extents but leaving the small
> files behind, you end up with this:
>
> 0 1 2 3 4 5 6 7 8 9 10
> LLLLLLLLLLAA---+---------BB---LLLLLLLLLLCC---+----+----DD---+
>
> And now we go to allocate a new large+small file pair (M+n)
> they'll get laid out like this:
>
> 0 1 2 3 4 5 6 7 8 9 10
> LLLLLLLLLLAA---MMMMMMMMMMBB---LLLLLLLLLLCC---nn---+----DD---+
>
> See how we lost a large aligned 2MB freespace @ 9 when the small
> file "nn" was laid down? repeat this fill and free pattern over and
> over again, and eventually it fragments the free space until there's
> no large contiguous free spaces left, and large aligned extents can
> no longer be allocated.
>
> For this to trigger you need the small files to be larger than 1
> stripe unit, but still much smaller than the extent size hint, and
> the small files need to hang around as the large files come and go.
This can happen, and indeed I see our default hint is 1MB, so our small
files use a 1MB hint. Looks like we should remove that 1MB hint since
it's reducing allocation flexibility for XFS without a good return. On
the other hand, I worry that because we bypass the page cache, XFS
doesn't get to see the entire file at one time and so it will get
fragmented.
Suppose I write a 4k file with a 1MB hint. How is that trailing (1MB-4k)
marked? Free extent, free extent with extra annotation, or allocated
extent? We may need to deallocate those extents? (will
FALLOC_FL_PUNCH_HOLE do the trick?)
>
>>>> Is this a known issue?
> The effect and symptom is - it's a generic large aligned extent vs small unaligned extent
> issue, but I've never seen it manifest in a user workload outside of
> a very constrained multistream realtime video ingest/playout
> workload (i.e. the workload the filestreams allocator was written
> for). And before you ask, no, the filestreams allocator does not
> solve this problem.
>
> The most common manifestation of this problem has been inode
> allocation on filesystems full of small files - inodes are allocated
> in large aligned extents compared to small files, and so eventually
> the filesystem runs out of large contigouous freespace and inodes
> can't be allocated. The sparse inodes mkfs option fixed this by
> allowing inodes to be allocated as sparse chunks so they could
> interleave into any free space available....
Shouldn't XFS fall back to a non-aligned allocation rather that
returning ENOSPC on a filesystem with 90% free space?
>
>>>> Would upgrading the kernel help?
>>> Not that I know of. If it's an extszhint vs free space fragmentation
>>> issue, then a kernel upgrade is unlikely to fix it.
> Upgrading the kernel won't fix it, because it's an extszhint vs free
> space fragmentation issue.
>
> Filesystems that get into this state are generally considered
> unrecoverable. Well, you can recover them by deleting everythign
> from them to reform contiguous free space, but you may as well just
> mkfs and restore from backup because it's much, much faster than
> waiting for rm -rf....
>
> And, really, I expect that a different filesystem geometry and/or
> mount options are going to be needed to avoid getting into this
> state again. However, I don't really know enough yet about what in
> the workload and allocator is triggering to cause the issue to say
> yet.
>
> Can I get access to the metadump to dig around in the filesystem
> directly so I can see how everything has ended up laid out? that
> will help me work out what is actually occurring and determine if
> mkfs/mount options can address the problem or whether deeper
> allocator algorithm changes may be necessary....
I will ask permission to share the dump.
Thanks a lot for all the explanations and help.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: ENSOPC on a 10% used disk
2018-10-18 11:00 ` Avi Kivity
@ 2018-10-18 13:36 ` Avi Kivity
2018-10-19 7:51 ` Dave Chinner
2018-10-18 15:44 ` Avi Kivity
2018-10-19 1:15 ` Dave Chinner
2 siblings, 1 reply; 26+ messages in thread
From: Avi Kivity @ 2018-10-18 13:36 UTC (permalink / raw)
To: Dave Chinner; +Cc: linux-xfs
On 18/10/2018 14.00, Avi Kivity wrote:
>
>> Can I get access to the metadump to dig around in the filesystem
>> directly so I can see how everything has ended up laid out? that
>> will help me work out what is actually occurring and determine if
>> mkfs/mount options can address the problem or whether deeper
>> allocator algorithm changes may be necessary....
>
>
> I will ask permission to share the dump.
>
>
>
I'll send you a link privately.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: ENSOPC on a 10% used disk
2018-10-18 11:00 ` Avi Kivity
2018-10-18 13:36 ` Avi Kivity
@ 2018-10-18 15:44 ` Avi Kivity
2018-10-18 16:11 ` Avi Kivity
2018-10-19 1:24 ` Dave Chinner
2018-10-19 1:15 ` Dave Chinner
2 siblings, 2 replies; 26+ messages in thread
From: Avi Kivity @ 2018-10-18 15:44 UTC (permalink / raw)
To: Dave Chinner; +Cc: linux-xfs
On 18/10/2018 14.00, Avi Kivity wrote:
>
>
> This can happen, and indeed I see our default hint is 1MB, so our
> small files use a 1MB hint. Looks like we should remove that 1MB hint
> since it's reducing allocation flexibility for XFS without a good return.
I convinced myself that this is the root cause, it fits perfectly with
your explanation. I still think that XFS should allocate *something*
rather than ENOSPC, but I can also understand someone wanting a guarantee.
> On the other hand, I worry that because we bypass the page cache, XFS
> doesn't get to see the entire file at one time and so it will get
> fragmented.
That's what happens. I write 1000 4k writes to 400 files, in parallel,
AIO+DIO. I got 400 perfectly-fragmented files, each had 1000 extents.
So I'll remove the default hint for small files, and replace it with
larger buffer sizes so we batch more and don't get 8k-sized extents
(which is our default buffer size).
>
>
> Suppose I write a 4k file with a 1MB hint. How is that trailing
> (1MB-4k) marked? Free extent, free extent with extra annotation, or
> allocated extent? We may need to deallocate those extents? (will
> FALLOC_FL_PUNCH_HOLE do the trick?)
>
I found an 11-year-old post from you that says those reservations are
freed on close:
https://linux-xfs.oss.sgi.narkive.com/Bpctu4DN/reducing-memory-requirements-for-high-extent-xfs-files#post6
This is consistent with xfs_db reporting those areas are free.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: ENSOPC on a 10% used disk
2018-10-17 7:52 ENSOPC on a 10% used disk Avi Kivity
2018-10-17 8:47 ` Christoph Hellwig
2018-10-18 1:37 ` Dave Chinner
@ 2018-10-18 15:54 ` Eric Sandeen
2018-10-21 11:49 ` Avi Kivity
2019-02-05 21:48 ` Dave Chinner
3 siblings, 1 reply; 26+ messages in thread
From: Eric Sandeen @ 2018-10-18 15:54 UTC (permalink / raw)
To: Avi Kivity, linux-xfs
On 10/17/18 2:52 AM, Avi Kivity wrote:
> I have a user running a 1.7TB filesystem with ~10% usage (as shown by df), getting sporadic ENOSPC errors. The disk is mounted with inode64 and has a relatively small number of large files. The disk is a single-member RAID0 array, with 1MB chunk size. There are 32 AGs. Running Linux 4.9.17.
>
>
> The write load consists of AIO/DIO writes, followed by unlinks of these files. The writes are non-size-changing (we truncate ahead) and we use XFS_IOC_FSSETXATTR/XFS_FLAG_EXTSIZE with a hint size of 32MB. The errors happen on commit logs, which have a target size of 32MB (but may exceed it a little).
>
>
> The errors are sporadic and after restarting the workload they go away for a few hours to a few days, but then return. During one of the crashes I used xfs_db to look at fragmentation and saw that most AGs had free extents of size categories up to 128-255, but a few had more. I tried xfs_fsr but it did not help.
>
>
> Is this a known issue? Would upgrading the kernel help?
>
>
> I'll try to get a metadata dump next time this happens, and I'll be happy to supply more information.
It sounds like you all figured this out, but I'll drop a reference to
One Weird Trick to figure out just what function is returning a specific
error value (the example below is EINVAL)
First is my hack, what follows was Dave's refinement. We should get this
into scripts/ some day.
> # for FUNCTION in `grep "t xfs_" /proc/kallsyms | awk '{print $3}'`; do echo "r:ret_$FUNCTION $FUNCTION \$retval" >> /sys/kernel/debug/tracing/kprobe_events; done
>
> # for ENABLE in /sys/kernel/debug/tracing/events/kprobes/ret_xfs_*/enable; do echo 1 > $ENABLE; done
>
> run a test that fails:
>
> # dd if=/dev/zero of=newfile bs=513 oflag=direct
> dd: writing `newfile': Invalid argument
>
> # for ENABLE in /sys/kernel/debug/tracing/events/kprobes/ret_xfs_*/enable; do echo 0 > $ENABLE; done
>
> # cat /sys/kernel/debug/tracing/trace
> <snip>
> <...>-63791 [000] d... 705435.568913: ret_xfs_vn_mknod: (xfs_vn_create+0x13/0x20 [xfs] <- xfs_vn_mknod) arg1=0
> <...>-63791 [000] d... 705435.568913: ret_xfs_vn_create: (vfs_create+0xdb/0x100 <- xfs_vn_create) arg1=0
> <...>-63791 [000] d... 705435.568918: ret_xfs_file_open: (do_dentry_open+0x24e/0x2e0 <- xfs_file_open) arg1=0
> <...>-63791 [000] d... 705435.568934: ret_xfs_file_dio_aio_write: (xfs_file_aio_write+0x147/0x150 [xfs] <- xfs_file_dio_aio_write) arg1=ffffffffffffffea
>
> Hey look, it's "-22" in hex!
>
> so it's possible, but bleah.
Dave later refined that to:
> #!/bin/bash
>
> TRACEDIR=/sys/kernel/debug/tracing
>
> grep -i 't xfs_' /proc/kallsyms | awk '{print $3}' ; while read F; do
> echo "r:ret_$F $F \$retval" >> $TRACEDIR/kprobe_events
> done
>
> for E in $TRACEDIR/events/kprobes/ret_xfs_*/enable; do
> echo 1 > $E
> done;
>
> echo 'arg1 > 0xffffffffffffff00' > $TRACEDIR/events/kprobes/filter
>
> for T in $TRACEDIR/events/kprobes/ret_xfs_*/trigger; do
> echo 'traceoff if arg1 > 0xffffffffffffff00' > $T
> done
> And that gives:
>
> # dd if=/dev/zero of=/mnt/scratch/newfile bs=513 oflag=direct
> dd: error writing ¿/mnt/scratch/newfile¿: Invalid argument
> 1+0 records in
> 0+0 records out
> 0 bytes (0 B) copied, 0.000259882 s, 0.0 kB/s
> root@test4:~# cat /sys/kernel/debug/tracing/trace
> # tracer: nop
> #
> # entries-in-buffer/entries-written: 1/1 #P:16
> #
> # _-----=> irqs-off
> # / _----=> need-resched
> # | / _---=> hardirq/softirq
> # || / _--=> preempt-depth
> # ||| / delay
> # TASK-PID CPU# |||| TIMESTAMP FUNCTION
> # | | | |||| | |
> <...>-8073 [006] d... 145740.460546: ret_xfs_file_dio_aio_write: (xfs_file_aio_write+0x170/0x180 <- xfs_file_dio_aio_write) arg1=0xffffffffffffffea
>
> Which is precisely the detection that XFS_ERROR would have given us.
> Ok, so I guess we can now add whatever need need to that trigger...
>
> Basically, pass in teh XFs function names you want to trace, the
> sets up teh events, whatever trigger beahviour you want, and
> we're off to the races...
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: ENSOPC on a 10% used disk
2018-10-18 15:44 ` Avi Kivity
@ 2018-10-18 16:11 ` Avi Kivity
2018-10-19 1:24 ` Dave Chinner
1 sibling, 0 replies; 26+ messages in thread
From: Avi Kivity @ 2018-10-18 16:11 UTC (permalink / raw)
To: Dave Chinner; +Cc: linux-xfs
On 18/10/2018 18.44, Avi Kivity wrote:
>
> On 18/10/2018 14.00, Avi Kivity wrote:
>>
>>
>> This can happen, and indeed I see our default hint is 1MB, so our
>> small files use a 1MB hint. Looks like we should remove that 1MB hint
>> since it's reducing allocation flexibility for XFS without a good
>> return.
>
>
> I convinced myself that this is the root cause, it fits perfectly with
> your explanation. I still think that XFS should allocate *something*
> rather than ENOSPC, but I can also understand someone wanting a
> guarantee.
>
A small twist: there were in fact lots of small files on that system,
caused by snapshots that the user did not remove. But I think the
explanation still holds.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: ENSOPC on a 10% used disk
2018-10-18 11:00 ` Avi Kivity
2018-10-18 13:36 ` Avi Kivity
2018-10-18 15:44 ` Avi Kivity
@ 2018-10-19 1:15 ` Dave Chinner
2018-10-21 9:21 ` Avi Kivity
2 siblings, 1 reply; 26+ messages in thread
From: Dave Chinner @ 2018-10-19 1:15 UTC (permalink / raw)
To: Avi Kivity; +Cc: linux-xfs
On Thu, Oct 18, 2018 at 02:00:19PM +0300, Avi Kivity wrote:
> On 18/10/2018 13.05, Dave Chinner wrote:
> >On Thu, Oct 18, 2018 at 10:55:18AM +0300, Avi Kivity wrote:
> >>On 18/10/2018 04.37, Dave Chinner wrote:
> >>>On Wed, Oct 17, 2018 at 10:52:48AM +0300, Avi Kivity wrote:
> >>>>I have a user running a 1.7TB filesystem with ~10% usage (as shown
> >>>>by df), getting sporadic ENOSPC errors. The disk is mounted with
> >>>>inode64 and has a relatively small number of large files. The disk
> >>>>is a single-member RAID0 array, with 1MB chunk size. There are 32
> >Ok, now I need to know what "single member RAID0 array" means,
> >becuase this is clearly related to allocation alignment and I need
> >to know why the FS was configured the way it was.
>
>
> It's a Linux RAID device, /dev/md0.
>
>
> We configure it this way so that it's easy to add storage (okay, the
> real reason is probably to avoid special casing one drive).
As a stripe? That requires resilvering to expand, which is a slow,
messy operation. There's also been too many horror stories about
crashes during rsilvering causing unrecoverable corruptions for my
liking...
> One disk, organized into a Linux RAID device with just one member.
So there's no realy need for IO alignment at all. Unaligned writes
to RAID0 don't require RMW cycles, so alignment is really onl used
to avoid hotspotting a disk in the stripe. Which isn't an issue
here, either.
> >>meta-data=/dev/loop2 isize=512 agcount=32, agsize=14494720 blks
> >> = sectsz=512 attr=2, projid32bit=1
> >> = crc=1 finobt=0 spinodes=0 rmapbt=0
> >> = reflink=0
> >>data = bsize=4096 blocks=463831040, imaxpct=5
> >> = sunit=256 swidth=256 blks
> >sunit=swidth is unusual for a RAID0 array, unless it's hardware RAID
> >and the array only reports one number to mkfs. What this chosen by
> >mkfs, or specifically configured by the user? If specifically
> >configured, why?
>
>
> I'm guessing it's because it has one member? I'm guessing the usual
> is swidth=sunit*nmembers?
*nod*. Which is unusual for a RAID0 device.
> >What is important is that it means aligned allocations will be used
> >for any allocation that is over sunit (1MB) and that's where all the
> >problems seem to come from.
>
> Do these aligned allocations not fall back to non-aligned
> allocations if they fail?
They do, but extent size hints change the fallback behaviour...
> >See how we lost a large aligned 2MB freespace @ 9 when the small
> >file "nn" was laid down? repeat this fill and free pattern over and
> >over again, and eventually it fragments the free space until there's
> >no large contiguous free spaces left, and large aligned extents can
> >no longer be allocated.
> >
> >For this to trigger you need the small files to be larger than 1
> >stripe unit, but still much smaller than the extent size hint, and
> >the small files need to hang around as the large files come and go.
>
>
> This can happen, and indeed I see our default hint is 1MB, so our
> small files use a 1MB hint.
Ok, which forces all allocations to be at least stripe unit (1MB)
aligned.
>
> Looks like we should remove that 1MB
> hint since it's reducing allocation flexibility for XFS without a
> good return. On the other hand, I worry that because we bypass the
> page cache, XFS doesn't get to see the entire file at one time and
> so it will get fragmented.
Yes. Your other option is to use an extent size hint that is smaller
than the sunit. That should not align to 1MB because the initial
data allocation size is not large enough to trigger stripe
alignment.
> Suppose I write a 4k file with a 1MB hint. How is that trailing
> (1MB-4k) marked? Free extent, free extent with extra annotation, or
> allocated extent? We may need to deallocate those extents? (will
> FALLOC_FL_PUNCH_HOLE do the trick?)
It's an unwritten extent beyond EOF, and how that is treated when
the file is last closed depends on how that extent was allocated.
But, yes, punching the range beyond EOF will definitely free it.
> >>>>Is this a known issue?
> >The effect and symptom is - it's a generic large aligned extent vs small unaligned extent
> >issue, but I've never seen it manifest in a user workload outside of
> >a very constrained multistream realtime video ingest/playout
> >workload (i.e. the workload the filestreams allocator was written
> >for). And before you ask, no, the filestreams allocator does not
> >solve this problem.
> >
> >The most common manifestation of this problem has been inode
> >allocation on filesystems full of small files - inodes are allocated
> >in large aligned extents compared to small files, and so eventually
> >the filesystem runs out of large contigouous freespace and inodes
> >can't be allocated. The sparse inodes mkfs option fixed this by
> >allowing inodes to be allocated as sparse chunks so they could
> >interleave into any free space available....
>
> Shouldn't XFS fall back to a non-aligned allocation rather that
> returning ENOSPC on a filesystem with 90% free space?
The filesystem does fall back to unaligned allocation - there's ~5
spearate, progressively less strict allocation attempts on failure.
The problem is that the extent size hint is asking to allocate a
contiguous 32MB extent and there's no contiguous 32MB free space
extent available, aligned or not. That's what I think is generating
the ENOSPC error, but it's not clear to me from the code whether it
is supposed to ignore the extent size hint on failure and allocate a
set of shorter unaligned extents or not....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: ENSOPC on a 10% used disk
2018-10-18 15:44 ` Avi Kivity
2018-10-18 16:11 ` Avi Kivity
@ 2018-10-19 1:24 ` Dave Chinner
2018-10-21 9:00 ` Avi Kivity
1 sibling, 1 reply; 26+ messages in thread
From: Dave Chinner @ 2018-10-19 1:24 UTC (permalink / raw)
To: Avi Kivity; +Cc: linux-xfs
On Thu, Oct 18, 2018 at 06:44:54PM +0300, Avi Kivity wrote:
>
> On 18/10/2018 14.00, Avi Kivity wrote:
> >
> >
> >This can happen, and indeed I see our default hint is 1MB, so our
> >small files use a 1MB hint. Looks like we should remove that 1MB
> >hint since it's reducing allocation flexibility for XFS without a
> >good return.
>
>
> I convinced myself that this is the root cause, it fits perfectly
> with your explanation. I still think that XFS should allocate
> *something* rather than ENOSPC, but I can also understand someone
> wanting a guarantee.
Yup, it's a classic catch 22.
> >On the other hand, I worry that because we bypass the page cache,
> >XFS doesn't get to see the entire file at one time and so it will
> >get fragmented.
>
>
> That's what happens. I write 1000 4k writes to 400 files, in
> parallel, AIO+DIO. I got 400 perfectly-fragmented files, each had
> 1000 extents.
Yup, you wrote them all in the one directory, didn't you? :)
> So I'll remove the default hint for small files, and replace it with
> larger buffer sizes so we batch more and don't get 8k-sized extents
> (which is our default buffer size).
Or you could just mount with the "noalign" mount option to turn off
stripe alignment. After all, you don't need stripe alignment for a
single spindle....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: ENSOPC on a 10% used disk
2018-10-18 13:36 ` Avi Kivity
@ 2018-10-19 7:51 ` Dave Chinner
2018-10-21 8:55 ` Avi Kivity
0 siblings, 1 reply; 26+ messages in thread
From: Dave Chinner @ 2018-10-19 7:51 UTC (permalink / raw)
To: Avi Kivity; +Cc: linux-xfs
On Thu, Oct 18, 2018 at 04:36:42PM +0300, Avi Kivity wrote:
> On 18/10/2018 14.00, Avi Kivity wrote:
> >>Can I get access to the metadump to dig around in the filesystem
> >>directly so I can see how everything has ended up laid out? that
> >>will help me work out what is actually occurring and determine if
> >>mkfs/mount options can address the problem or whether deeper
> >>allocator algorithm changes may be necessary....
> >
> >I will ask permission to share the dump.
>
> I'll send you a link privately.
Thanks - I've started looking at this - the information here is
just layout stuff - I'm omitted filenames and anything else that
might be identifying from the output.
Looking at a commit log file:
stat.size = 33554432
stat.blocks = 34720
fsxattr.xflags = 0x800 [----------e-----]
fsxattr.projid = 0
fsxattr.extsize = 33554432
fsxattr.cowextsize = 0
fsxattr.nextents = 14
and the layout:
EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS
0: [0..4079]: 2646677520..2646681599 22 (95606800..95610879) 4080 001010
1: [4080..8159]: 2643130384..2643134463 22 (92059664..92063743) 4080 001010
2: [8160..12239]: 2642124816..2642128895 22 (91054096..91058175) 4080 001010
3: [12240..16319]: 2640666640..2640670719 22 (89595920..89599999) 4080 001010
4: [16320..18367]: 2640523264..2640525311 22 (89452544..89454591) 2048 000000
5: [18368..20415]: 2640119808..2640121855 22 (89049088..89051135) 2048 000000
6: [20416..21287]: 2639874064..2639874935 22 (88803344..88804215) 872 001111
7: [21288..21295]: 2639874936..2639874943 22 (88804216..88804223) 8 011111
8: [21296..24495]: 2639874944..2639878143 22 (88804224..88807423) 3200 001010
9: [24496..26543]: 2639427584..2639429631 22 (88356864..88358911) 2048 000000
10: [26544..28591]: 2638981120..2638983167 22 (87910400..87912447) 2048 000000
11: [28592..30639]: 2638770176..2638772223 22 (87699456..87701503) 2048 000000
12: [30640..31279]: 2638247952..2638248591 22 (87177232..87177871) 640 001111
13: [31280..34719]: 2638248592..2638252031 22 (87177872..87181311) 3440 011010
14: [34720..65535]: hole 30816
The first thing I note is the initial allocations are just short of
2MB and so the extent size hint is, indeed, being truncated here
according to contiguous free space limitations. I had thought that
should occur from reading the code, but it's complex and I wasn't
100% certain what minimum allocation length would be used.
Looking at the system batchlog files, I'm guessing the filesystem
ran out of contiguous 32MB free space extents some time around
September 25. The *Data.db files from 24 Sep and earlier then are
all nice 32MB extents, from 25 sep onwards they never make the full
32MB (30-31MB max). eg, good:
EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS
0: [0..65535]: 350524552..350590087 3 (2651272..2716807) 65536 001111
1: [65536..131071]: 353378024..353443559 3 (5504744..5570279) 65536 001111
2: [131072..196607]: 355147016..355212551 3 (7273736..7339271) 65536 001111
3: [196608..262143]: 360029416..360094951 3 (12156136..12221671) 65536 001111
4: [262144..327679]: 362244144..362309679 3 (14370864..14436399) 65536 001111
5: [327680..343415]: 365809456..365825191 3 (17936176..17951911) 15736 001111
bad:
EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS
0: [0..64127]: 512855496..512919623 4 (49024456..49088583) 64128 001111
1: [64128..128247]: 266567048..266631167 2 (34651528..34715647) 64120 001010
2: [128248..142327]: 264401888..264415967 2 (32486368..32500447) 14080 001111
Hmmm - there's 2 million files in this filesystem. that is quite a
lot...
Ok... I see where all the files are - there's a db that was
snapshotted every half hour going back to December 19 2017. There's
55GB of snapshot data there 14362 snapshots holding in 1.8million
files.
Ok, now I understand how the filesystem got into this mess. It has
nothing really to do with the filesystem allocator, geometry, extent
size hints, etc. It isn't really even an XFS specific problem - I
think most filesystems would be in trouble if you did this to them.
First, let me demonstrate that the freespace fragmentation is caused
by these snapshots by removing them all:
before:
from to extents blocks pct
1 1 5916 5916 0.00
2 3 10235 22678 0.01
4 7 12251 66829 0.02
8 15 5521 59556 0.01
16 31 5703 132031 0.03
32 63 9754 463825 0.11
64 127 16742 1590339 0.37
128 255 1550511 390108625 89.87
256 511 71516 29178504 6.72
512 1023 19 15355 0.00
1024 2047 287 461824 0.11
2048 4095 528 1611413 0.37
4096 8191 1537 10352304 2.38
8192 16383 2 19015 0.00
Run a delete:
for d in snapshots/*; do
rm -rf $d &
done
<cranking along at ~12,000 write iops>
# uptime
17:41:08 up 22:07, 1 user, load average: 14293.17, 13840.37, 9517.14
#
500,000 files removed:
from to extents blocks pct
64 127 22564 2054234 0.47
128 255 900480 226428059 51.43
256 511 189904 91033237 20.68
512 1023 68304 54958788 12.48
1024 2047 25187 38284024 8.70
2048 4095 5508 15204528 3.45
4096 8191 1665 10999789 2.50
8192 16383 15 139424 0.03
1m files removed:
from to extents blocks pct
64 127 21940 1991685 0.45
128 255 536985 134731402 30.35
256 511 152092 73465972 16.55
512 1023 100471 82971130 18.69
1024 2047 48519 74016490 16.67
2048 4095 17272 49209538 11.09
4096 8191 4307 25135374 5.66
8192 16383 135 1254037 0.28
1.5m files removed:
from to extents blocks pct
64 127 9851 924782 0.20
128 255 227945 57079302 12.32
256 511 38723 18129086 3.91
512 1023 33547 28027554 6.05
1024 2047 31904 50171699 10.83
2048 4095 25263 75381887 16.27
4096 8191 16885 102836365 22.19
8192 16383 6367 68809645 14.85
16384 32767 1862 40183775 8.67
32768 65535 385 16228869 3.50
65536 131071 51 4213237 0.91
131072 262143 6 958528 0.21
after:
from to extents blocks pct
128 255 154063 38785829 8.64
256 511 11037 4942114 1.10
512 1023 8576 6930035 1.54
1024 2047 8496 13464298 3.00
2048 4095 7664 23034455 5.13
4096 8191 8497 55217061 12.31
8192 16383 4233 45867691 10.22
16384 32767 1533 33488995 7.46
32768 65535 520 23924895 5.33
65536 131071 305 28675646 6.39
131072 262143 230 42411732 9.45
262144 524287 98 37213190 8.29
524288 1048575 41 29163579 6.50
1048576 2097151 27 40502889 9.03
2097152 4194303 5 14576157 3.25
4194304 8388607 2 10005670 2.23
Ok, so the results is not perfect, but there are now huge contiguous
free space extents available again - ~70% of the free space is now
contiguous extents >=32MB in length. There's every chance that the
fs would confinue to help reform large contiguous free spaces as the
database files come and go now, as long as the snapshot problem is
dealt with.
So, what's the problem? Well, it's simply that the workload is
mixing data with vastly different temporal characteristics in the
same physical locality. Every half an hour, a set of ~100 smallish
files are written into a new directory which lands them at the low
endof the largest free space extent in that AG. Each new snapshot
directory ends up in a different AG, so it slowly spreads the
snapshots across all the AGs in the filesystem.
Each snapshot effective appends to the current working area in the
AG, chopping it out of the largest contiguous free space. By the
time the next snapshot in that AG comes around, there's other new
short term data between the old snapshot and the new one. The new
snapshot chops up the largest freespace, and on goes the cycle.
Eventually the short term data between the snapshots gets removed,
but this doesn't reform large contiguous free spaces because the
snapshot data is in the way. And so this cycle continues with the
snapshot data chopping up the largest freespace extents in the
filesystem until there's not more large free space extents to be
found.
The solution is to manage the snapshot data better. We need to keep
all the long term data physically isolated from the short term data
so they don't fragment free space. A short term application level
solution would require migrating the snapshot data out of the
filesystem to somewhere else and point to it with symlinks.
>From the filesystem POV, I'm not sure that there is much we can do
about this directly - we have no idea what the lifetime of the data
is going to be....
<ding>
Hold on....
<rummage in code>
....we already have an interface so setting those sorts of hints.
fcntl(F_SET_RW_HINT, rw_hint)
/*
* Valid hint values for F_{GET,SET}_RW_HINT. 0 is "not set", or can be
* used to clear any hints previously set.
*/
#define RWF_WRITE_LIFE_NOT_SET 0
#define RWH_WRITE_LIFE_NONE 1
#define RWH_WRITE_LIFE_SHORT 2
#define RWH_WRITE_LIFE_MEDIUM 3
#define RWH_WRITE_LIFE_LONG 4
#define RWH_WRITE_LIFE_EXTREME 5
Avi, does this sound like something that you could use to
classify the different types of data the data base writes out?
I'll need to have a think about how to apply this to the allocator
policy algorithms before going any further, but I suspect making use
of this hint interface will allow us prevent interleaving of short
and long term data so avoid the freespace fragmentation it is
causing here....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: ENSOPC on a 10% used disk
2018-10-19 7:51 ` Dave Chinner
@ 2018-10-21 8:55 ` Avi Kivity
2018-10-21 14:28 ` Dave Chinner
0 siblings, 1 reply; 26+ messages in thread
From: Avi Kivity @ 2018-10-21 8:55 UTC (permalink / raw)
To: Dave Chinner; +Cc: linux-xfs
On 19/10/2018 10.51, Dave Chinner wrote:
> On Thu, Oct 18, 2018 at 04:36:42PM +0300, Avi Kivity wrote:
>> On 18/10/2018 14.00, Avi Kivity wrote:
>>>> Can I get access to the metadump to dig around in the filesystem
>>>> directly so I can see how everything has ended up laid out? that
>>>> will help me work out what is actually occurring and determine if
>>>> mkfs/mount options can address the problem or whether deeper
>>>> allocator algorithm changes may be necessary....
>>> I will ask permission to share the dump.
>> I'll send you a link privately.
> Thanks - I've started looking at this - the information here is
> just layout stuff - I'm omitted filenames and anything else that
> might be identifying from the output.
>
> Looking at a commit log file:
>
> stat.size = 33554432
> stat.blocks = 34720
> fsxattr.xflags = 0x800 [----------e-----]
> fsxattr.projid = 0
> fsxattr.extsize = 33554432
> fsxattr.cowextsize = 0
> fsxattr.nextents = 14
>
>
> and the layout:
>
> EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS
> 0: [0..4079]: 2646677520..2646681599 22 (95606800..95610879) 4080 001010
> 1: [4080..8159]: 2643130384..2643134463 22 (92059664..92063743) 4080 001010
> 2: [8160..12239]: 2642124816..2642128895 22 (91054096..91058175) 4080 001010
> 3: [12240..16319]: 2640666640..2640670719 22 (89595920..89599999) 4080 001010
> 4: [16320..18367]: 2640523264..2640525311 22 (89452544..89454591) 2048 000000
> 5: [18368..20415]: 2640119808..2640121855 22 (89049088..89051135) 2048 000000
> 6: [20416..21287]: 2639874064..2639874935 22 (88803344..88804215) 872 001111
> 7: [21288..21295]: 2639874936..2639874943 22 (88804216..88804223) 8 011111
> 8: [21296..24495]: 2639874944..2639878143 22 (88804224..88807423) 3200 001010
> 9: [24496..26543]: 2639427584..2639429631 22 (88356864..88358911) 2048 000000
> 10: [26544..28591]: 2638981120..2638983167 22 (87910400..87912447) 2048 000000
> 11: [28592..30639]: 2638770176..2638772223 22 (87699456..87701503) 2048 000000
> 12: [30640..31279]: 2638247952..2638248591 22 (87177232..87177871) 640 001111
> 13: [31280..34719]: 2638248592..2638252031 22 (87177872..87181311) 3440 011010
> 14: [34720..65535]: hole 30816
>
> The first thing I note is the initial allocations are just short of
> 2MB and so the extent size hint is, indeed, being truncated here
> according to contiguous free space limitations. I had thought that
> should occur from reading the code, but it's complex and I wasn't
> 100% certain what minimum allocation length would be used.
>
> Looking at the system batchlog files, I'm guessing the filesystem
> ran out of contiguous 32MB free space extents some time around
> September 25. The *Data.db files from 24 Sep and earlier then are
> all nice 32MB extents, from 25 sep onwards they never make the full
> 32MB (30-31MB max). eg, good:
>
> EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS
> 0: [0..65535]: 350524552..350590087 3 (2651272..2716807) 65536 001111
> 1: [65536..131071]: 353378024..353443559 3 (5504744..5570279) 65536 001111
> 2: [131072..196607]: 355147016..355212551 3 (7273736..7339271) 65536 001111
> 3: [196608..262143]: 360029416..360094951 3 (12156136..12221671) 65536 001111
> 4: [262144..327679]: 362244144..362309679 3 (14370864..14436399) 65536 001111
> 5: [327680..343415]: 365809456..365825191 3 (17936176..17951911) 15736 001111
>
> bad:
>
> EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS
> 0: [0..64127]: 512855496..512919623 4 (49024456..49088583) 64128 001111
> 1: [64128..128247]: 266567048..266631167 2 (34651528..34715647) 64120 001010
> 2: [128248..142327]: 264401888..264415967 2 (32486368..32500447) 14080 001111
>
So extent size is a hint but the extent alignment is a hard requirement.
Since eventually the ENOSPC happened due to the alignment restriction, I
think the alignment requirement should be made a hint too.
> Hmmm - there's 2 million files in this filesystem. that is quite a
> lot...
>
> Ok... I see where all the files are - there's a db that was
> snapshotted every half hour going back to December 19 2017. There's
> 55GB of snapshot data there 14362 snapshots holding in 1.8million
> files.
>
> Ok, now I understand how the filesystem got into this mess. It has
> nothing really to do with the filesystem allocator, geometry, extent
> size hints, etc. It isn't really even an XFS specific problem - I
> think most filesystems would be in trouble if you did this to them.
Well, if you create snapshots and never delete them you'd run into a
real ENOSPC sooner or later, so the main problem was lack of snapshot
hygene. But it did trigger a premature ENOSPC due to the alignment
restriction on those small files with hints (which I'm going to remove).
>
> First, let me demonstrate that the freespace fragmentation is caused
> by these snapshots by removing them all:
>
> before:
> from to extents blocks pct
> 1 1 5916 5916 0.00
> 2 3 10235 22678 0.01
> 4 7 12251 66829 0.02
> 8 15 5521 59556 0.01
> 16 31 5703 132031 0.03
> 32 63 9754 463825 0.11
> 64 127 16742 1590339 0.37
> 128 255 1550511 390108625 89.87
> 256 511 71516 29178504 6.72
> 512 1023 19 15355 0.00
> 1024 2047 287 461824 0.11
> 2048 4095 528 1611413 0.37
> 4096 8191 1537 10352304 2.38
> 8192 16383 2 19015 0.00
>
> Run a delete:
>
> for d in snapshots/*; do
> rm -rf $d &
> done
>
> <cranking along at ~12,000 write iops>
>
> # uptime
> 17:41:08 up 22:07, 1 user, load average: 14293.17, 13840.37, 9517.14
> #
>
> 500,000 files removed:
> from to extents blocks pct
> 64 127 22564 2054234 0.47
> 128 255 900480 226428059 51.43
> 256 511 189904 91033237 20.68
> 512 1023 68304 54958788 12.48
> 1024 2047 25187 38284024 8.70
> 2048 4095 5508 15204528 3.45
> 4096 8191 1665 10999789 2.50
> 8192 16383 15 139424 0.03
>
> 1m files removed:
> from to extents blocks pct
> 64 127 21940 1991685 0.45
> 128 255 536985 134731402 30.35
> 256 511 152092 73465972 16.55
> 512 1023 100471 82971130 18.69
> 1024 2047 48519 74016490 16.67
> 2048 4095 17272 49209538 11.09
> 4096 8191 4307 25135374 5.66
> 8192 16383 135 1254037 0.28
>
> 1.5m files removed:
> from to extents blocks pct
> 64 127 9851 924782 0.20
> 128 255 227945 57079302 12.32
> 256 511 38723 18129086 3.91
> 512 1023 33547 28027554 6.05
> 1024 2047 31904 50171699 10.83
> 2048 4095 25263 75381887 16.27
> 4096 8191 16885 102836365 22.19
> 8192 16383 6367 68809645 14.85
> 16384 32767 1862 40183775 8.67
> 32768 65535 385 16228869 3.50
> 65536 131071 51 4213237 0.91
> 131072 262143 6 958528 0.21
>
> after:
> from to extents blocks pct
> 128 255 154063 38785829 8.64
> 256 511 11037 4942114 1.10
> 512 1023 8576 6930035 1.54
> 1024 2047 8496 13464298 3.00
> 2048 4095 7664 23034455 5.13
> 4096 8191 8497 55217061 12.31
> 8192 16383 4233 45867691 10.22
> 16384 32767 1533 33488995 7.46
> 32768 65535 520 23924895 5.33
> 65536 131071 305 28675646 6.39
> 131072 262143 230 42411732 9.45
> 262144 524287 98 37213190 8.29
> 524288 1048575 41 29163579 6.50
> 1048576 2097151 27 40502889 9.03
> 2097152 4194303 5 14576157 3.25
> 4194304 8388607 2 10005670 2.23
>
> Ok, so the results is not perfect, but there are now huge contiguous
> free space extents available again - ~70% of the free space is now
> contiguous extents >=32MB in length. There's every chance that the
> fs would confinue to help reform large contiguous free spaces as the
> database files come and go now, as long as the snapshot problem is
> dealt with.
>
> So, what's the problem? Well, it's simply that the workload is
> mixing data with vastly different temporal characteristics in the
> same physical locality. Every half an hour, a set of ~100 smallish
> files are written into a new directory which lands them at the low
> endof the largest free space extent in that AG. Each new snapshot
> directory ends up in a different AG, so it slowly spreads the
> snapshots across all the AGs in the filesystem.
Not exactly - those snapshots are hard links into the live database
files, which eventually get removed. Usually, small files get removed
early, but with the snapshots they get to live forever.
> Each snapshot effective appends to the current working area in the
> AG, chopping it out of the largest contiguous free space. By the
> time the next snapshot in that AG comes around, there's other new
> short term data between the old snapshot and the new one. The new
> snapshot chops up the largest freespace, and on goes the cycle.
>
> Eventually the short term data between the snapshots gets removed,
> but this doesn't reform large contiguous free spaces because the
> snapshot data is in the way. And so this cycle continues with the
> snapshot data chopping up the largest freespace extents in the
> filesystem until there's not more large free space extents to be
> found.
>
> The solution is to manage the snapshot data better. We need to keep
> all the long term data physically isolated from the short term data
> so they don't fragment free space. A short term application level
> solution would require migrating the snapshot data out of the
> filesystem to somewhere else and point to it with symlinks.
Snapshots should not live forever on the disk. The procedure is to
create a snapshot, copy it away, and then delete the snapshot. It's okay
to let snapshots live for a while, but not all of them and not without a
bound on their lifetime.
The filesystem did have a role in this, by requiring alignment of the
extent to the RAID stripe size. Now, given that this was a RAID with one
member, alignment is pointless, but most of our deployments are to RAID
arrays with >1 members, and alignment does save 12.5% of IOPS compared
to un-aligned extents for compactions and writes (our scans/writes use
128k buffers, and the alignment is to 1MB). The database caused the
problem by indirectly requiring 1MB alignment for files that are much
smaller than 1MB, and the user contributed to the problem by causing
millions of such small files to be kept.
>
> From the filesystem POV, I'm not sure that there is much we can do
> about this directly - we have no idea what the lifetime of the data
> is going to be....
>
> <ding>
>
> Hold on....
>
> <rummage in code>
>
> ....we already have an interface so setting those sorts of hints.
>
> fcntl(F_SET_RW_HINT, rw_hint)
>
> /*
> * Valid hint values for F_{GET,SET}_RW_HINT. 0 is "not set", or can be
> * used to clear any hints previously set.
> */
> #define RWF_WRITE_LIFE_NOT_SET 0
> #define RWH_WRITE_LIFE_NONE 1
> #define RWH_WRITE_LIFE_SHORT 2
> #define RWH_WRITE_LIFE_MEDIUM 3
> #define RWH_WRITE_LIFE_LONG 4
> #define RWH_WRITE_LIFE_EXTREME 5
>
> Avi, does this sound like something that you could use to
> classify the different types of data the data base writes out?
So long as the penalty for a mis-classification is not too large, we can
for sure. Commitlog files have short lifespan, and so do newly born
small data files. Those small data files are compacted into increasingly
larger and long-lived files, and this information is known at the time
of creation.
Even without the filesystem altering its allocation according to the
hint, this is still useful, since the disk will alter its internal
allocation and maybe do something useful with it (as long as the
filesystem passes the hint to the disk).
>
> I'll need to have a think about how to apply this to the allocator
> policy algorithms before going any further, but I suspect making use
> of this hint interface will allow us prevent interleaving of short
> and long term data so avoid the freespace fragmentation it is
> causing here....
IIUC, the problem (of having ENOSPC on a 10% used disk) is not
fragmentation per se, it's the alignment requirement. To take it to
extreme, a 1TB disk can only hold a million files if those files must be
aligned to 1MB, even if everything is perfectly laid out. For sure
fragmentation would have degraded performance sooner or later, but
that's not as bad as that ENOSPC.
I'm addressing the ENOSPC by removing the extent allocation hint on
files that are known small (and increasing their application buffer
sizes). In fact that will increase fragmentation as the filesystem will
allocate on extent per buffer, rather than one extent for the entire
file. But I think that, given that the extent size is treated as a hint
(or so I infer from the fact that we have <32MB extents), so should the
alignment. Perhaps allocation with a hint should be performed in two
passes, first trying to match size and alignment, and second relaxing
both restrictions.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: ENSOPC on a 10% used disk
2018-10-19 1:24 ` Dave Chinner
@ 2018-10-21 9:00 ` Avi Kivity
2018-10-21 14:34 ` Dave Chinner
0 siblings, 1 reply; 26+ messages in thread
From: Avi Kivity @ 2018-10-21 9:00 UTC (permalink / raw)
To: Dave Chinner; +Cc: linux-xfs
On 19/10/2018 04.24, Dave Chinner wrote:
> On Thu, Oct 18, 2018 at 06:44:54PM +0300, Avi Kivity wrote:
>> On 18/10/2018 14.00, Avi Kivity wrote:
>>>
>>> This can happen, and indeed I see our default hint is 1MB, so our
>>> small files use a 1MB hint. Looks like we should remove that 1MB
>>> hint since it's reducing allocation flexibility for XFS without a
>>> good return.
>>
>> I convinced myself that this is the root cause, it fits perfectly
>> with your explanation. I still think that XFS should allocate
>> *something* rather than ENOSPC, but I can also understand someone
>> wanting a guarantee.
> Yup, it's a classic catch 22.
>
>>> On the other hand, I worry that because we bypass the page cache,
>>> XFS doesn't get to see the entire file at one time and so it will
>>> get fragmented.
>>
>> That's what happens. I write 1000 4k writes to 400 files, in
>> parallel, AIO+DIO. I got 400 perfectly-fragmented files, each had
>> 1000 extents.
> Yup, you wrote them all in the one directory, didn't you? :)
Yes :(
But if I have more concurrently-written files than AGs, I'd get the same
behavior with multiple directories, no?
>> So I'll remove the default hint for small files, and replace it with
>> larger buffer sizes so we batch more and don't get 8k-sized extents
>> (which is our default buffer size).
> Or you could just mount with the "noalign" mount option to turn off
> stripe alignment. After all, you don't need stripe alignment for a
> single spindle....
For a single spindle, sure. But most deployments have multiple spindles.
Since these aren't real spindles, the advantages of alignment are not as
great, but they still exist. The files are written with aligned offsets,
and some of the reads are also aligned, so it saves IOPS whenever we
cross an alignment boundary.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: ENSOPC on a 10% used disk
2018-10-19 1:15 ` Dave Chinner
@ 2018-10-21 9:21 ` Avi Kivity
2018-10-21 15:06 ` Dave Chinner
0 siblings, 1 reply; 26+ messages in thread
From: Avi Kivity @ 2018-10-21 9:21 UTC (permalink / raw)
To: Dave Chinner; +Cc: linux-xfs
On 19/10/2018 04.15, Dave Chinner wrote:
> On Thu, Oct 18, 2018 at 02:00:19PM +0300, Avi Kivity wrote:
>> On 18/10/2018 13.05, Dave Chinner wrote:
>>> On Thu, Oct 18, 2018 at 10:55:18AM +0300, Avi Kivity wrote:
>>>> On 18/10/2018 04.37, Dave Chinner wrote:
>>>>> On Wed, Oct 17, 2018 at 10:52:48AM +0300, Avi Kivity wrote:
>>>>>> I have a user running a 1.7TB filesystem with ~10% usage (as shown
>>>>>> by df), getting sporadic ENOSPC errors. The disk is mounted with
>>>>>> inode64 and has a relatively small number of large files. The disk
>>>>>> is a single-member RAID0 array, with 1MB chunk size. There are 32
>>> Ok, now I need to know what "single member RAID0 array" means,
>>> becuase this is clearly related to allocation alignment and I need
>>> to know why the FS was configured the way it was.
>>
>> It's a Linux RAID device, /dev/md0.
>>
>>
>> We configure it this way so that it's easy to add storage (okay, the
>> real reason is probably to avoid special casing one drive).
> As a stripe? That requires resilvering to expand, which is a slow,
> messy operation. There's also been too many horror stories about
> crashes during rsilvering causing unrecoverable corruptions for my
> liking...
Like I said, the real reason is to avoid a special case for one disk. I
don't think we, or one of our users, ever expanded a RAID array in this way.
>
>> One disk, organized into a Linux RAID device with just one member.
> So there's no realy need for IO alignment at all. Unaligned writes
> to RAID0 don't require RMW cycles, so alignment is really onl used
> to avoid hotspotting a disk in the stripe. Which isn't an issue
> here, either.
It does help (for >1 member arrays) in avoiding a logically aligned read
or write to be split into two ops targeting two disks.
>>>> meta-data=/dev/loop2 isize=512 agcount=32, agsize=14494720 blks
>>>> = sectsz=512 attr=2, projid32bit=1
>>>> = crc=1 finobt=0 spinodes=0 rmapbt=0
>>>> = reflink=0
>>>> data = bsize=4096 blocks=463831040, imaxpct=5
>>>> = sunit=256 swidth=256 blks
>>> sunit=swidth is unusual for a RAID0 array, unless it's hardware RAID
>>> and the array only reports one number to mkfs. What this chosen by
>>> mkfs, or specifically configured by the user? If specifically
>>> configured, why?
>>
>> I'm guessing it's because it has one member? I'm guessing the usual
>> is swidth=sunit*nmembers?
> *nod*. Which is unusual for a RAID0 device.
>
>>> What is important is that it means aligned allocations will be used
>>> for any allocation that is over sunit (1MB) and that's where all the
>>> problems seem to come from.
>> Do these aligned allocations not fall back to non-aligned
>> allocations if they fail?
> They do, but extent size hints change the fallback behaviour...
>
>>> See how we lost a large aligned 2MB freespace @ 9 when the small
>>> file "nn" was laid down? repeat this fill and free pattern over and
>>> over again, and eventually it fragments the free space until there's
>>> no large contiguous free spaces left, and large aligned extents can
>>> no longer be allocated.
>>>
>>> For this to trigger you need the small files to be larger than 1
>>> stripe unit, but still much smaller than the extent size hint, and
>>> the small files need to hang around as the large files come and go.
>>
>> This can happen, and indeed I see our default hint is 1MB, so our
>> small files use a 1MB hint.
> Ok, which forces all allocations to be at least stripe unit (1MB)
> aligned.
If the hint were smaller than the stripe unit, would it remove the
alignment requirement? I see you answered below.
>> Looks like we should remove that 1MB
>> hint since it's reducing allocation flexibility for XFS without a
>> good return. On the other hand, I worry that because we bypass the
>> page cache, XFS doesn't get to see the entire file at one time and
>> so it will get fragmented.
> Yes. Your other option is to use an extent size hint that is smaller
> than the sunit. That should not align to 1MB because the initial
> data allocation size is not large enough to trigger stripe
> alignment.
Wow, so we had so many factors leading to this:
- 1-disk installations arranged as RAID0 even though not strictly needed
- having a default extent allocation hint, even for small files
- having that default hint be >= the stripe unit size
- the user not removing snapshots
- XFS not falling back to unaligned allocations
>> Suppose I write a 4k file with a 1MB hint. How is that trailing
>> (1MB-4k) marked? Free extent, free extent with extra annotation, or
>> allocated extent? We may need to deallocate those extents? (will
>> FALLOC_FL_PUNCH_HOLE do the trick?)
> It's an unwritten extent beyond EOF, and how that is treated when
> the file is last closed depends on how that extent was allocated.
> But, yes, punching the range beyond EOF will definitely free it.
I think we can conclude from the dump that the filesystem freed it?
>>>>>> Is this a known issue?
>>> The effect and symptom is - it's a generic large aligned extent vs small unaligned extent
>>> issue, but I've never seen it manifest in a user workload outside of
>>> a very constrained multistream realtime video ingest/playout
>>> workload (i.e. the workload the filestreams allocator was written
>>> for). And before you ask, no, the filestreams allocator does not
>>> solve this problem.
>>>
>>> The most common manifestation of this problem has been inode
>>> allocation on filesystems full of small files - inodes are allocated
>>> in large aligned extents compared to small files, and so eventually
>>> the filesystem runs out of large contigouous freespace and inodes
>>> can't be allocated. The sparse inodes mkfs option fixed this by
>>> allowing inodes to be allocated as sparse chunks so they could
>>> interleave into any free space available....
>> Shouldn't XFS fall back to a non-aligned allocation rather that
>> returning ENOSPC on a filesystem with 90% free space?
> The filesystem does fall back to unaligned allocation - there's ~5
> spearate, progressively less strict allocation attempts on failure.
>
> The problem is that the extent size hint is asking to allocate a
> contiguous 32MB extent and there's no contiguous 32MB free space
> extent available, aligned or not. That's what I think is generating
> the ENOSPC error, but it's not clear to me from the code whether it
> is supposed to ignore the extent size hint on failure and allocate a
> set of shorter unaligned extents or not....
Here's a file from the dump:
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 1eb2: 3928e00.. 392acb2: 1eb3:
1: 1eb3.. 3cb2: 3c91200.. 3c92fff: 1e00: 392acb3:
2: 3cb3.. 57b2: 3454100.. 3455bff: 1b00: 3c93000:
3: 57b3.. 6fb2: 34ecd00.. 34ee4ff: 1800: 3455c00:
4: 6fb3.. 85fe: 3386a00.. 338804b: 164c: 34ee500:
5: 85ff.. 9c0b: 2c85c00.. 2c8720c: 160d: 338804c:
6: 9c0c.. b217: 3099900.. 309af0b: 160c: 2c8720d:
7: b218.. c823: 34fb300.. 34fc90b: 160c: 309af0c:
8: c824.. de2b: 315ef00.. 3160507: 1608: 34fc90c:
9: de2c.. f42f: 36adc00.. 36af203: 1604: 3160508:
10: f430.. 10a30: 2cf4400.. 2cf5a00: 1601: 36af204:
11: 10a31.. 12030: 2e03300.. 2e048ff: 1600: 2cf5a01:
12: 12031.. 13630: 2ff5200.. 2ff67ff: 1600: 2e04900:
13: 13631.. 14c30: 3199e00.. 319b3ff: 1600: 2ff6800:
14: 14c31.. 16230: 32ed500.. 32eeaff: 1600: 319b400:
15: 16231.. 17830: 34a0b00.. 34a20ff: 1600: 32eeb00:
16: 17831.. 18e30: 354e700.. 354fcff: 1600: 34a2100:
17: 18e31.. 1a430: 362c400.. 362d9ff: 1600: 354fd00:
18: 1a431.. 1ba1d: 3192b00.. 31940ec: 15ed: 362da00:
19: 1ba1e.. 1d05c: 4228500.. 4229b3e: 163f: 31940ed:
20: 1d05d.. 1e692: 3f6c900.. 3f6df35: 1636: 4229b3f:
21: 1e693.. 1fcc0: 37d4400.. 37d5a2d: 162e: 3f6df36:
22: 1fcc1.. 212e4: 43f9c00.. 43fb223: 1624: 37d5a2e:
23: 212e5.. 22905: 4003500.. 4004b20: 1621: 43fb224:
24: 22906.. 23803: 1fdb900.. 1fdc7fd: efe: 4004b21: last,eof
So, lengths are not always aligned, but physical_offset always is. So
XFS relaxes the extent size hint but not alignment.
It looks like XFS allocates one extent and moves on, not trying to
allocate all the way to the 32MB hint size. If that were the case, we'd
see logical_offset restore alignment every 32MB.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: ENSOPC on a 10% used disk
2018-10-18 15:54 ` Eric Sandeen
@ 2018-10-21 11:49 ` Avi Kivity
0 siblings, 0 replies; 26+ messages in thread
From: Avi Kivity @ 2018-10-21 11:49 UTC (permalink / raw)
To: Eric Sandeen, linux-xfs
On 18/10/2018 18.54, Eric Sandeen wrote:
> On 10/17/18 2:52 AM, Avi Kivity wrote:
>> I have a user running a 1.7TB filesystem with ~10% usage (as shown by df), getting sporadic ENOSPC errors. The disk is mounted with inode64 and has a relatively small number of large files. The disk is a single-member RAID0 array, with 1MB chunk size. There are 32 AGs. Running Linux 4.9.17.
>>
>>
>> The write load consists of AIO/DIO writes, followed by unlinks of these files. The writes are non-size-changing (we truncate ahead) and we use XFS_IOC_FSSETXATTR/XFS_FLAG_EXTSIZE with a hint size of 32MB. The errors happen on commit logs, which have a target size of 32MB (but may exceed it a little).
>>
>>
>> The errors are sporadic and after restarting the workload they go away for a few hours to a few days, but then return. During one of the crashes I used xfs_db to look at fragmentation and saw that most AGs had free extents of size categories up to 128-255, but a few had more. I tried xfs_fsr but it did not help.
>>
>>
>> Is this a known issue? Would upgrading the kernel help?
>>
>>
>> I'll try to get a metadata dump next time this happens, and I'll be happy to supply more information.
> It sounds like you all figured this out, but I'll drop a reference to
> One Weird Trick to figure out just what function is returning a specific
> error value (the example below is EINVAL)
>
> First is my hack, what follows was Dave's refinement. We should get this
> into scripts/ some day.
Cool, although to get noticed these days you have to put in bpf
somewhere (and probably it can help with some kernel-side filtering -
start logging as soon as you see the error, and hopefully you can
recover the path from the returns).
>> # for FUNCTION in `grep "t xfs_" /proc/kallsyms | awk '{print $3}'`; do echo "r:ret_$FUNCTION $FUNCTION \$retval" >> /sys/kernel/debug/tracing/kprobe_events; done
>>
>> # for ENABLE in /sys/kernel/debug/tracing/events/kprobes/ret_xfs_*/enable; do echo 1 > $ENABLE; done
>>
>> run a test that fails:
>>
>> # dd if=/dev/zero of=newfile bs=513 oflag=direct
>> dd: writing `newfile': Invalid argument
>>
>> # for ENABLE in /sys/kernel/debug/tracing/events/kprobes/ret_xfs_*/enable; do echo 0 > $ENABLE; done
>>
>> # cat /sys/kernel/debug/tracing/trace
>> <snip>
>> <...>-63791 [000] d... 705435.568913: ret_xfs_vn_mknod: (xfs_vn_create+0x13/0x20 [xfs] <- xfs_vn_mknod) arg1=0
>> <...>-63791 [000] d... 705435.568913: ret_xfs_vn_create: (vfs_create+0xdb/0x100 <- xfs_vn_create) arg1=0
>> <...>-63791 [000] d... 705435.568918: ret_xfs_file_open: (do_dentry_open+0x24e/0x2e0 <- xfs_file_open) arg1=0
>> <...>-63791 [000] d... 705435.568934: ret_xfs_file_dio_aio_write: (xfs_file_aio_write+0x147/0x150 [xfs] <- xfs_file_dio_aio_write) arg1=ffffffffffffffea
>>
>> Hey look, it's "-22" in hex!
>>
>> so it's possible, but bleah.
> Dave later refined that to:
>
>> #!/bin/bash
>>
>> TRACEDIR=/sys/kernel/debug/tracing
>>
>> grep -i 't xfs_' /proc/kallsyms | awk '{print $3}' ; while read F; do
>> echo "r:ret_$F $F \$retval" >> $TRACEDIR/kprobe_events
>> done
>>
>> for E in $TRACEDIR/events/kprobes/ret_xfs_*/enable; do
>> echo 1 > $E
>> done;
>>
>> echo 'arg1 > 0xffffffffffffff00' > $TRACEDIR/events/kprobes/filter
>>
>> for T in $TRACEDIR/events/kprobes/ret_xfs_*/trigger; do
>> echo 'traceoff if arg1 > 0xffffffffffffff00' > $T
>> done
>
>
>> And that gives:
>>
>> # dd if=/dev/zero of=/mnt/scratch/newfile bs=513 oflag=direct
>> dd: error writing ¿/mnt/scratch/newfile¿: Invalid argument
>> 1+0 records in
>> 0+0 records out
>> 0 bytes (0 B) copied, 0.000259882 s, 0.0 kB/s
>> root@test4:~# cat /sys/kernel/debug/tracing/trace
>> # tracer: nop
>> #
>> # entries-in-buffer/entries-written: 1/1 #P:16
>> #
>> # _-----=> irqs-off
>> # / _----=> need-resched
>> # | / _---=> hardirq/softirq
>> # || / _--=> preempt-depth
>> # ||| / delay
>> # TASK-PID CPU# |||| TIMESTAMP FUNCTION
>> # | | | |||| | |
>> <...>-8073 [006] d... 145740.460546: ret_xfs_file_dio_aio_write: (xfs_file_aio_write+0x170/0x180 <- xfs_file_dio_aio_write) arg1=0xffffffffffffffea
>>
>> Which is precisely the detection that XFS_ERROR would have given us.
>> Ok, so I guess we can now add whatever need need to that trigger...
>>
>> Basically, pass in teh XFs function names you want to trace, the
>> sets up teh events, whatever trigger beahviour you want, and
>> we're off to the races...
>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: ENSOPC on a 10% used disk
2018-10-21 8:55 ` Avi Kivity
@ 2018-10-21 14:28 ` Dave Chinner
2018-10-22 8:35 ` Avi Kivity
0 siblings, 1 reply; 26+ messages in thread
From: Dave Chinner @ 2018-10-21 14:28 UTC (permalink / raw)
To: Avi Kivity; +Cc: linux-xfs
On Sun, Oct 21, 2018 at 11:55:47AM +0300, Avi Kivity wrote:
>
> On 19/10/2018 10.51, Dave Chinner wrote:
> >On Thu, Oct 18, 2018 at 04:36:42PM +0300, Avi Kivity wrote:
> >>On 18/10/2018 14.00, Avi Kivity wrote:
> >>>>Can I get access to the metadump to dig around in the filesystem
> >>>>directly so I can see how everything has ended up laid out? that
> >>>>will help me work out what is actually occurring and determine if
> >>>>mkfs/mount options can address the problem or whether deeper
> >>>>allocator algorithm changes may be necessary....
> >>>I will ask permission to share the dump.
> >>I'll send you a link privately.
> >Thanks - I've started looking at this - the information here is
> >just layout stuff - I'm omitted filenames and anything else that
> >might be identifying from the output.
> >
> >Looking at a commit log file:
> >
> >stat.size = 33554432
> >stat.blocks = 34720
> >fsxattr.xflags = 0x800 [----------e-----]
> >fsxattr.projid = 0
> >fsxattr.extsize = 33554432
> >fsxattr.cowextsize = 0
> >fsxattr.nextents = 14
> >
> >
> >and the layout:
> >
> >EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS
> > 0: [0..4079]: 2646677520..2646681599 22 (95606800..95610879) 4080 001010
> > 1: [4080..8159]: 2643130384..2643134463 22 (92059664..92063743) 4080 001010
> > 2: [8160..12239]: 2642124816..2642128895 22 (91054096..91058175) 4080 001010
> > 3: [12240..16319]: 2640666640..2640670719 22 (89595920..89599999) 4080 001010
> > 4: [16320..18367]: 2640523264..2640525311 22 (89452544..89454591) 2048 000000
> > 5: [18368..20415]: 2640119808..2640121855 22 (89049088..89051135) 2048 000000
> > 6: [20416..21287]: 2639874064..2639874935 22 (88803344..88804215) 872 001111
> > 7: [21288..21295]: 2639874936..2639874943 22 (88804216..88804223) 8 011111
> > 8: [21296..24495]: 2639874944..2639878143 22 (88804224..88807423) 3200 001010
> > 9: [24496..26543]: 2639427584..2639429631 22 (88356864..88358911) 2048 000000
> > 10: [26544..28591]: 2638981120..2638983167 22 (87910400..87912447) 2048 000000
> > 11: [28592..30639]: 2638770176..2638772223 22 (87699456..87701503) 2048 000000
> > 12: [30640..31279]: 2638247952..2638248591 22 (87177232..87177871) 640 001111
> > 13: [31280..34719]: 2638248592..2638252031 22 (87177872..87181311) 3440 011010
> > 14: [34720..65535]: hole 30816
> >
> >The first thing I note is the initial allocations are just short of
> >2MB and so the extent size hint is, indeed, being truncated here
> >according to contiguous free space limitations. I had thought that
> >should occur from reading the code, but it's complex and I wasn't
> >100% certain what minimum allocation length would be used.
> >
> >Looking at the system batchlog files, I'm guessing the filesystem
> >ran out of contiguous 32MB free space extents some time around
> >September 25. The *Data.db files from 24 Sep and earlier then are
> >all nice 32MB extents, from 25 sep onwards they never make the full
> >32MB (30-31MB max). eg, good:
> >
> > EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS
> > 0: [0..65535]: 350524552..350590087 3 (2651272..2716807) 65536 001111
> > 1: [65536..131071]: 353378024..353443559 3 (5504744..5570279) 65536 001111
> > 2: [131072..196607]: 355147016..355212551 3 (7273736..7339271) 65536 001111
> > 3: [196608..262143]: 360029416..360094951 3 (12156136..12221671) 65536 001111
> > 4: [262144..327679]: 362244144..362309679 3 (14370864..14436399) 65536 001111
> > 5: [327680..343415]: 365809456..365825191 3 (17936176..17951911) 15736 001111
> >
> >bad:
> >
> >EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS
> > 0: [0..64127]: 512855496..512919623 4 (49024456..49088583) 64128 001111
> > 1: [64128..128247]: 266567048..266631167 2 (34651528..34715647) 64120 001010
> > 2: [128248..142327]: 264401888..264415967 2 (32486368..32500447) 14080 001111
>
>
> So extent size is a hint but the extent alignment is a hard
> requirement.
No, physical alignment is being ignored here, too. THose flags on
the end?
FLAG Values:
0100000 Shared extent
0010000 Unwritten preallocated extent
0001000 Doesn't begin on stripe unit
0000100 Doesn't end on stripe unit
0000010 Doesn't begin on stripe width
0000001 Doesn't end on stripe width
When you have 001111, the allocation was completely unaligned.
When you have 001010, the tail is stripe aligned
When you ahve 000000, the head and tail are stripe aligned
As you can see, there is a mix of aligned, tail aligned and
completely unaligned extents.
So, no, XFS is droping both size hints and alignment hints when
it starts running out of aligned contiguous free space extents.
> >Ok, so the results is not perfect, but there are now huge contiguous
> >free space extents available again - ~70% of the free space is now
> >contiguous extents >=32MB in length. There's every chance that the
> >fs would confinue to help reform large contiguous free spaces as the
> >database files come and go now, as long as the snapshot problem is
> >dealt with.
> >
> >So, what's the problem? Well, it's simply that the workload is
> >mixing data with vastly different temporal characteristics in the
> >same physical locality. Every half an hour, a set of ~100 smallish
> >files are written into a new directory which lands them at the low
> >endof the largest free space extent in that AG. Each new snapshot
> >directory ends up in a different AG, so it slowly spreads the
> >snapshots across all the AGs in the filesystem.
>
>
> Not exactly - those snapshots are hard links into the live database
> files, which eventually get removed. Usually, small files get
> removed early, but with the snapshots they get to live forever.
They might be created as hard links, but the effect when the
orginal database file links are removed is the same - the snapshotted
data lives forever, interleaved amongst short term data.
> >Each snapshot effective appends to the current working area in the
> >AG, chopping it out of the largest contiguous free space. By the
> >time the next snapshot in that AG comes around, there's other new
> >short term data between the old snapshot and the new one. The new
> >snapshot chops up the largest freespace, and on goes the cycle.
> >
> >Eventually the short term data between the snapshots gets removed,
> >but this doesn't reform large contiguous free spaces because the
> >snapshot data is in the way. And so this cycle continues with the
> >snapshot data chopping up the largest freespace extents in the
> >filesystem until there's not more large free space extents to be
> >found.
> >
> >The solution is to manage the snapshot data better. We need to keep
> >all the long term data physically isolated from the short term data
> >so they don't fragment free space. A short term application level
> >solution would require migrating the snapshot data out of the
> >filesystem to somewhere else and point to it with symlinks.
>
>
> Snapshots should not live forever on the disk. The procedure is to
> create a snapshot, copy it away, and then delete the snapshot. It's
> okay to let snapshots live for a while, but not all of them and not
> without a bound on their lifetime.
>
>
> The filesystem did have a role in this, by requiring alignment of
> the extent to the RAID stripe size.
No, in the end it didn't.
> Now, given that this was a RAID
> with one member, alignment is pointless, but most of our deployments
> are to RAID arrays with >1 members, and alignment does save 12.5% of
> IOPS compared to un-aligned extents for compactions and writes (our
> scans/writes use 128k buffers, and the alignment is to 1MB). The
> database caused the problem by indirectly requiring 1MB alignment
> for files that are much smaller than 1MB, and the user contributed
> to the problem by causing millions of such small files to be kept.
*nod*
> >
> ><ding>
> >
> >Hold on....
> >
> ><rummage in code>
> >
> >....we already have an interface so setting those sorts of hints.
> >
> >fcntl(F_SET_RW_HINT, rw_hint)
> >
> >/*
> > * Valid hint values for F_{GET,SET}_RW_HINT. 0 is "not set", or can be
> > * used to clear any hints previously set.
> > */
> >#define RWF_WRITE_LIFE_NOT_SET 0
> >#define RWH_WRITE_LIFE_NONE 1
> >#define RWH_WRITE_LIFE_SHORT 2
> >#define RWH_WRITE_LIFE_MEDIUM 3
> >#define RWH_WRITE_LIFE_LONG 4
> >#define RWH_WRITE_LIFE_EXTREME 5
> >
> >Avi, does this sound like something that you could use to
> >classify the different types of data the data base writes out?
>
>
> So long as the penalty for a mis-classification is not too large, we
> can for sure.
OK.
> >I'll need to have a think about how to apply this to the allocator
> >policy algorithms before going any further, but I suspect making use
> >of this hint interface will allow us prevent interleaving of short
> >and long term data so avoid the freespace fragmentation it is
> >causing here....
>
>
> IIUC, the problem (of having ENOSPC on a 10% used disk) is not
> fragmentation per se, it's the alignment requirement.
Which, as I've noted above, alignment is a hint, not a requirement.
> To take it to
> extreme, a 1TB disk can only hold a million files if those files
> must be aligned to 1MB, even if everything is perfectly laid out.
> For sure fragmentation would have degraded performance sooner or
> later, but that's not as bad as that ENOSPC.
What it comes down to is that having looked into it, I don't know
why that ENOSPC error occurred.
Alignment didn't cause it because alignment was being dropped - that
just caused free space fragmentation. Extent size hints didn't
cause it because the size hints were dropped - that just caused
freespace fragmentation. A lack of free space
didn't cause it, because there was heaps of free space in all
allocation groups.
But something tickled a corner case that triggered an allocation
failure that was interpretted as ENOSPC rather than retrying the
allocation. Until I can reproduce the ENOSPC allocation failure
(and I tried!) then it'll be a mystery as to what caused it.
> entire file. But I think that, given that the extent size is treated
> as a hint (or so I infer from the fact that we have <32MB extents),
> so should the alignment. Perhaps allocation with a hint should be
> performed in two passes, first trying to match size and alignment,
> and second relaxing both restrictions.
I think I already mentioned there were 5 separate attmepts to
allocate, each failure reducing restrictions:
1. extent sized and contiguous to adjacent block in file
2. extent sized and aligned, at higher block in AG
3. extent sized, not aligned, at higher block in AG
4. >= minimum length, not aligned, anywhere in AG >= target AG
5. minimum length, not aligned, in any AG
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: ENSOPC on a 10% used disk
2018-10-21 9:00 ` Avi Kivity
@ 2018-10-21 14:34 ` Dave Chinner
0 siblings, 0 replies; 26+ messages in thread
From: Dave Chinner @ 2018-10-21 14:34 UTC (permalink / raw)
To: Avi Kivity; +Cc: linux-xfs
On Sun, Oct 21, 2018 at 12:00:16PM +0300, Avi Kivity wrote:
>
> On 19/10/2018 04.24, Dave Chinner wrote:
> >On Thu, Oct 18, 2018 at 06:44:54PM +0300, Avi Kivity wrote:
> >>On 18/10/2018 14.00, Avi Kivity wrote:
> >>>
> >>>This can happen, and indeed I see our default hint is 1MB, so our
> >>>small files use a 1MB hint. Looks like we should remove that 1MB
> >>>hint since it's reducing allocation flexibility for XFS without a
> >>>good return.
> >>
> >>I convinced myself that this is the root cause, it fits perfectly
> >>with your explanation. I still think that XFS should allocate
> >>*something* rather than ENOSPC, but I can also understand someone
> >>wanting a guarantee.
> >Yup, it's a classic catch 22.
> >
> >>>On the other hand, I worry that because we bypass the page cache,
> >>>XFS doesn't get to see the entire file at one time and so it will
> >>>get fragmented.
> >>
> >>That's what happens. I write 1000 4k writes to 400 files, in
> >>parallel, AIO+DIO. I got 400 perfectly-fragmented files, each had
> >>1000 extents.
> >Yup, you wrote them all in the one directory, didn't you? :)
>
>
> Yes :(
>
> But if I have more concurrently-written files than AGs, I'd get the
> same behavior with multiple directories, no?
Up to a point. At which point, I'd say you're doing it wrong and
tell you to use extent size hints or buffered IO so the filesystem
can turn the small random writes in nicely formed large IOs via
delayed allocation. :)
Remember the first rule of storage: Garbage In, Garbage Out.
With direct IO, it's the responsibility of the application to give
the fileystem and storage layers well formed IOs. If the app doesn't
play nice, there's nothing the filesystem or storage layers can do
to make it better....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: ENSOPC on a 10% used disk
2018-10-21 9:21 ` Avi Kivity
@ 2018-10-21 15:06 ` Dave Chinner
0 siblings, 0 replies; 26+ messages in thread
From: Dave Chinner @ 2018-10-21 15:06 UTC (permalink / raw)
To: Avi Kivity; +Cc: linux-xfs
On Sun, Oct 21, 2018 at 12:21:33PM +0300, Avi Kivity wrote:
>
> On 19/10/2018 04.15, Dave Chinner wrote:
> >On Thu, Oct 18, 2018 at 02:00:19PM +0300, Avi Kivity wrote:
> >>On 18/10/2018 13.05, Dave Chinner wrote:
> >>>On Thu, Oct 18, 2018 at 10:55:18AM +0300, Avi Kivity wrote:
> >>>>On 18/10/2018 04.37, Dave Chinner wrote:
> >>Looks like we should remove that 1MB
> >>hint since it's reducing allocation flexibility for XFS without a
> >>good return. On the other hand, I worry that because we bypass the
> >>page cache, XFS doesn't get to see the entire file at one time and
> >>so it will get fragmented.
> >Yes. Your other option is to use an extent size hint that is smaller
> >than the sunit. That should not align to 1MB because the initial
> >data allocation size is not large enough to trigger stripe
> >alignment.
>
>
> Wow, so we had so many factors leading to this:
>
> - 1-disk installations arranged as RAID0 even though not strictly needed
>
> - having a default extent allocation hint, even for small files
>
> - having that default hint be >= the stripe unit size
>
> - the user not removing snapshots
>
> - XFS not falling back to unaligned allocations
Everything but the last is true. XFS is definitely dropping the
alignment hint once there are no more aligned contiguous free space
extents.
> >>Suppose I write a 4k file with a 1MB hint. How is that trailing
> >>(1MB-4k) marked? Free extent, free extent with extra annotation, or
> >>allocated extent? We may need to deallocate those extents? (will
> >>FALLOC_FL_PUNCH_HOLE do the trick?)
> >It's an unwritten extent beyond EOF, and how that is treated when
> >the file is last closed depends on how that extent was allocated.
> >But, yes, punching the range beyond EOF will definitely free it.
>
> I think we can conclude from the dump that the filesystem freed it?
*nod*
> ext: logical_offset: physical_offset: length: expected: flags:
> 0: 0.. 1eb2: 3928e00.. 392acb2: 1eb3:
> 1: 1eb3.. 3cb2: 3c91200.. 3c92fff: 1e00: 392acb3:
> 2: 3cb3.. 57b2: 3454100.. 3455bff: 1b00: 3c93000:
> 3: 57b3.. 6fb2: 34ecd00.. 34ee4ff: 1800: 3455c00:
> 4: 6fb3.. 85fe: 3386a00.. 338804b: 164c: 34ee500:
> 5: 85ff.. 9c0b: 2c85c00.. 2c8720c: 160d: 338804c:
> 6: 9c0c.. b217: 3099900.. 309af0b: 160c: 2c8720d:
> 7: b218.. c823: 34fb300.. 34fc90b: 160c: 309af0c:
> 8: c824.. de2b: 315ef00.. 3160507: 1608: 34fc90c:
> 9: de2c.. f42f: 36adc00.. 36af203: 1604: 3160508:
> 10: f430.. 10a30: 2cf4400.. 2cf5a00: 1601: 36af204:
> 11: 10a31.. 12030: 2e03300.. 2e048ff: 1600: 2cf5a01:
> 12: 12031.. 13630: 2ff5200.. 2ff67ff: 1600: 2e04900:
> 13: 13631.. 14c30: 3199e00.. 319b3ff: 1600: 2ff6800:
> 14: 14c31.. 16230: 32ed500.. 32eeaff: 1600: 319b400:
> 15: 16231.. 17830: 34a0b00.. 34a20ff: 1600: 32eeb00:
> 16: 17831.. 18e30: 354e700.. 354fcff: 1600: 34a2100:
> 17: 18e31.. 1a430: 362c400.. 362d9ff: 1600: 354fd00:
> 18: 1a431.. 1ba1d: 3192b00.. 31940ec: 15ed: 362da00:
> 19: 1ba1e.. 1d05c: 4228500.. 4229b3e: 163f: 31940ed:
> 20: 1d05d.. 1e692: 3f6c900.. 3f6df35: 1636: 4229b3f:
> 21: 1e693.. 1fcc0: 37d4400.. 37d5a2d: 162e: 3f6df36:
> 22: 1fcc1.. 212e4: 43f9c00.. 43fb223: 1624: 37d5a2e:
> 23: 212e5.. 22905: 4003500.. 4004b20: 1621: 43fb224:
> 24: 22906.. 23803: 1fdb900.. 1fdc7fd: efe: 4004b21: last,eof
filefrag? I find that utterly unreadable, an dwithout the command
line I don't know what the units are. can you use 'xfs_bmap -vvp'
so that all the units are known and it automatically calculates
whethere extents are aligned or not?
> So, lengths are not always aligned, but physical_offset always is.
> So XFS relaxes the extent size hint but not alignment.
No, that is incorrect.
Filesystems never do what people expect them to.
i.e. what you see above is because the filesystem could not find
large enough contiguous free spaces to align both the ends of the
allocation. i.e.
Freespace looks like:
+----FF+FFFFFF+FFFFFF+FFFF-+------+
Alloc aligned w/ min len and max len
+----FF+FFFFFF+FFFFFF+FFFF-+------+
+WANT-THIS-BIT_HERE-+
But the nearest target free space extent returns:
fffffffffffffffffffff
So we trim the front
fffffffffffffffffff
if len < min len, fail (didn't happen)
if > max len, trim end (no trim, not long enough)
And so we end up allocating front aligned and short:
+WANT-THIS-BIT_HER+
Leaving behind:
+----FF+------+------+-----+------+
That's why it looks like there are aligned extents remaining, even
when there isn't.
The allocation logic is horrifically complex - it has 20-something
controlling parameters and a heap of logic, maths and fallback paths
around them. Unless you're intimately familiar with the code,
you're unlikely to infer the allocator decisions from an extent
list....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: ENSOPC on a 10% used disk
2018-10-21 14:28 ` Dave Chinner
@ 2018-10-22 8:35 ` Avi Kivity
2018-10-22 9:52 ` Dave Chinner
0 siblings, 1 reply; 26+ messages in thread
From: Avi Kivity @ 2018-10-22 8:35 UTC (permalink / raw)
To: Dave Chinner; +Cc: linux-xfs
On 21/10/2018 17.28, Dave Chinner wrote:
> On Sun, Oct 21, 2018 at 11:55:47AM +0300, Avi Kivity wrote:
>> On 19/10/2018 10.51, Dave Chinner wrote:
>>> On Thu, Oct 18, 2018 at 04:36:42PM +0300, Avi Kivity wrote:
>>>> On 18/10/2018 14.00, Avi Kivity wrote:
>>>>>> Can I get access to the metadump to dig around in the filesystem
>>>>>> directly so I can see how everything has ended up laid out? that
>>>>>> will help me work out what is actually occurring and determine if
>>>>>> mkfs/mount options can address the problem or whether deeper
>>>>>> allocator algorithm changes may be necessary....
>>>>> I will ask permission to share the dump.
>>>> I'll send you a link privately.
>>> Thanks - I've started looking at this - the information here is
>>> just layout stuff - I'm omitted filenames and anything else that
>>> might be identifying from the output.
>>>
>>> Looking at a commit log file:
>>>
>>> stat.size = 33554432
>>> stat.blocks = 34720
>>> fsxattr.xflags = 0x800 [----------e-----]
>>> fsxattr.projid = 0
>>> fsxattr.extsize = 33554432
>>> fsxattr.cowextsize = 0
>>> fsxattr.nextents = 14
>>>
>>>
>>> and the layout:
>>>
>>> EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS
>>> 0: [0..4079]: 2646677520..2646681599 22 (95606800..95610879) 4080 001010
>>> 1: [4080..8159]: 2643130384..2643134463 22 (92059664..92063743) 4080 001010
>>> 2: [8160..12239]: 2642124816..2642128895 22 (91054096..91058175) 4080 001010
>>> 3: [12240..16319]: 2640666640..2640670719 22 (89595920..89599999) 4080 001010
>>> 4: [16320..18367]: 2640523264..2640525311 22 (89452544..89454591) 2048 000000
>>> 5: [18368..20415]: 2640119808..2640121855 22 (89049088..89051135) 2048 000000
>>> 6: [20416..21287]: 2639874064..2639874935 22 (88803344..88804215) 872 001111
>>> 7: [21288..21295]: 2639874936..2639874943 22 (88804216..88804223) 8 011111
>>> 8: [21296..24495]: 2639874944..2639878143 22 (88804224..88807423) 3200 001010
>>> 9: [24496..26543]: 2639427584..2639429631 22 (88356864..88358911) 2048 000000
>>> 10: [26544..28591]: 2638981120..2638983167 22 (87910400..87912447) 2048 000000
>>> 11: [28592..30639]: 2638770176..2638772223 22 (87699456..87701503) 2048 000000
>>> 12: [30640..31279]: 2638247952..2638248591 22 (87177232..87177871) 640 001111
>>> 13: [31280..34719]: 2638248592..2638252031 22 (87177872..87181311) 3440 011010
>>> 14: [34720..65535]: hole 30816
>>>
>>> The first thing I note is the initial allocations are just short of
>>> 2MB and so the extent size hint is, indeed, being truncated here
>>> according to contiguous free space limitations. I had thought that
>>> should occur from reading the code, but it's complex and I wasn't
>>> 100% certain what minimum allocation length would be used.
>>>
>>> Looking at the system batchlog files, I'm guessing the filesystem
>>> ran out of contiguous 32MB free space extents some time around
>>> September 25. The *Data.db files from 24 Sep and earlier then are
>>> all nice 32MB extents, from 25 sep onwards they never make the full
>>> 32MB (30-31MB max). eg, good:
>>>
>>> EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS
>>> 0: [0..65535]: 350524552..350590087 3 (2651272..2716807) 65536 001111
>>> 1: [65536..131071]: 353378024..353443559 3 (5504744..5570279) 65536 001111
>>> 2: [131072..196607]: 355147016..355212551 3 (7273736..7339271) 65536 001111
>>> 3: [196608..262143]: 360029416..360094951 3 (12156136..12221671) 65536 001111
>>> 4: [262144..327679]: 362244144..362309679 3 (14370864..14436399) 65536 001111
>>> 5: [327680..343415]: 365809456..365825191 3 (17936176..17951911) 15736 001111
>>>
>>> bad:
>>>
>>> EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS
>>> 0: [0..64127]: 512855496..512919623 4 (49024456..49088583) 64128 001111
>>> 1: [64128..128247]: 266567048..266631167 2 (34651528..34715647) 64120 001010
>>> 2: [128248..142327]: 264401888..264415967 2 (32486368..32500447) 14080 001111
>>
>> So extent size is a hint but the extent alignment is a hard
>> requirement.
> No, physical alignment is being ignored here, too. THose flags on
> the end?
>
> FLAG Values:
> 0100000 Shared extent
> 0010000 Unwritten preallocated extent
> 0001000 Doesn't begin on stripe unit
> 0000100 Doesn't end on stripe unit
> 0000010 Doesn't begin on stripe width
> 0000001 Doesn't end on stripe width
>
> When you have 001111, the allocation was completely unaligned.
> When you have 001010, the tail is stripe aligned
> When you ahve 000000, the head and tail are stripe aligned
>
> As you can see, there is a mix of aligned, tail aligned and
> completely unaligned extents.
>
> So, no, XFS is droping both size hints and alignment hints when
> it starts running out of aligned contiguous free space extents.
You are right; I searched for and found some files that where
head-aligned, and jumped to conclusions, but there are many that are
not. Those head-aligned files probably belonged to an era in that
filesystem's life where head-aligned extents less than 1MB were available.
>>> Ok, so the results is not perfect, but there are now huge contiguous
>>> free space extents available again - ~70% of the free space is now
>>> contiguous extents >=32MB in length. There's every chance that the
>>> fs would confinue to help reform large contiguous free spaces as the
>>> database files come and go now, as long as the snapshot problem is
>>> dealt with.
>>>
>>> So, what's the problem? Well, it's simply that the workload is
>>> mixing data with vastly different temporal characteristics in the
>>> same physical locality. Every half an hour, a set of ~100 smallish
>>> files are written into a new directory which lands them at the low
>>> endof the largest free space extent in that AG. Each new snapshot
>>> directory ends up in a different AG, so it slowly spreads the
>>> snapshots across all the AGs in the filesystem.
>>
>> Not exactly - those snapshots are hard links into the live database
>> files, which eventually get removed. Usually, small files get
>> removed early, but with the snapshots they get to live forever.
> They might be created as hard links, but the effect when the
> orginal database file links are removed is the same - the snapshotted
> data lives forever, interleaved amongst short term data.
Yes.
>>> Each snapshot effective appends to the current working area in the
>>> AG, chopping it out of the largest contiguous free space. By the
>>> time the next snapshot in that AG comes around, there's other new
>>> short term data between the old snapshot and the new one. The new
>>> snapshot chops up the largest freespace, and on goes the cycle.
>>>
>>> Eventually the short term data between the snapshots gets removed,
>>> but this doesn't reform large contiguous free spaces because the
>>> snapshot data is in the way. And so this cycle continues with the
>>> snapshot data chopping up the largest freespace extents in the
>>> filesystem until there's not more large free space extents to be
>>> found.
>>>
>>> The solution is to manage the snapshot data better. We need to keep
>>> all the long term data physically isolated from the short term data
>>> so they don't fragment free space. A short term application level
>>> solution would require migrating the snapshot data out of the
>>> filesystem to somewhere else and point to it with symlinks.
>>
>> Snapshots should not live forever on the disk. The procedure is to
>> create a snapshot, copy it away, and then delete the snapshot. It's
>> okay to let snapshots live for a while, but not all of them and not
>> without a bound on their lifetime.
>>
>>
>> The filesystem did have a role in this, by requiring alignment of
>> the extent to the RAID stripe size.
> No, in the end it didn't.
Right.
>
>> Now, given that this was a RAID
>> with one member, alignment is pointless, but most of our deployments
>> are to RAID arrays with >1 members, and alignment does save 12.5% of
>> IOPS compared to un-aligned extents for compactions and writes (our
>> scans/writes use 128k buffers, and the alignment is to 1MB). The
>> database caused the problem by indirectly requiring 1MB alignment
>> for files that are much smaller than 1MB, and the user contributed
>> to the problem by causing millions of such small files to be kept.
> *nod*
>
>>> <ding>
>>>
>>> Hold on....
>>>
>>> <rummage in code>
>>>
>>> ....we already have an interface so setting those sorts of hints.
>>>
>>> fcntl(F_SET_RW_HINT, rw_hint)
>>>
>>> /*
>>> * Valid hint values for F_{GET,SET}_RW_HINT. 0 is "not set", or can be
>>> * used to clear any hints previously set.
>>> */
>>> #define RWF_WRITE_LIFE_NOT_SET 0
>>> #define RWH_WRITE_LIFE_NONE 1
>>> #define RWH_WRITE_LIFE_SHORT 2
>>> #define RWH_WRITE_LIFE_MEDIUM 3
>>> #define RWH_WRITE_LIFE_LONG 4
>>> #define RWH_WRITE_LIFE_EXTREME 5
>>>
>>> Avi, does this sound like something that you could use to
>>> classify the different types of data the data base writes out?
>>
>> So long as the penalty for a mis-classification is not too large, we
>> can for sure.
> OK.
>
>>> I'll need to have a think about how to apply this to the allocator
>>> policy algorithms before going any further, but I suspect making use
>>> of this hint interface will allow us prevent interleaving of short
>>> and long term data so avoid the freespace fragmentation it is
>>> causing here....
>>
>> IIUC, the problem (of having ENOSPC on a 10% used disk) is not
>> fragmentation per se, it's the alignment requirement.
> Which, as I've noted above, alignment is a hint, not a requirement.
>
>> To take it to
>> extreme, a 1TB disk can only hold a million files if those files
>> must be aligned to 1MB, even if everything is perfectly laid out.
>> For sure fragmentation would have degraded performance sooner or
>> later, but that's not as bad as that ENOSPC.
> What it comes down to is that having looked into it, I don't know
> why that ENOSPC error occurred.
>
> Alignment didn't cause it because alignment was being dropped - that
> just caused free space fragmentation. Extent size hints didn't
> cause it because the size hints were dropped - that just caused
> freespace fragmentation. A lack of free space
> didn't cause it, because there was heaps of free space in all
> allocation groups.
>
> But something tickled a corner case that triggered an allocation
> failure that was interpretted as ENOSPC rather than retrying the
> allocation. Until I can reproduce the ENOSPC allocation failure
> (and I tried!) then it'll be a mystery as to what caused it.
The user reported the error happening multiple times, taking many hours
to reproduce, but on more than one node. So it's an obscure corner case
but not obscure enough to be a one-off event.
I've asked the user to regularly trim their snapshots (they we're not
aware of the snapshots actually - they were performed as a side effect
of a TRUNCATE operation), and we'll remove the default extent hint for
small files. I'll also consider noalign - the 12.5% reduction in IOPS is
perhaps not worth the fragmentation it generates.
>
>> entire file. But I think that, given that the extent size is treated
>> as a hint (or so I infer from the fact that we have <32MB extents),
>> so should the alignment. Perhaps allocation with a hint should be
>> performed in two passes, first trying to match size and alignment,
>> and second relaxing both restrictions.
> I think I already mentioned there were 5 separate attmepts to
> allocate, each failure reducing restrictions:
>
> 1. extent sized and contiguous to adjacent block in file
> 2. extent sized and aligned, at higher block in AG
> 3. extent sized, not aligned, at higher block in AG
> 4. >= minimum length, not aligned, anywhere in AG >= target AG
Surprised at this one. Won't it skew usage in high AGs?
Perhaps it's rare enough not to matter.
Perhaps those higher-block/higher-AG heuristics can be improved for
non-rotational media.
> 5. minimum length, not aligned, in any AG
Thanks for your patience in helping me understand this issue.
Avi
> Cheers,
>
> Dave.
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: ENSOPC on a 10% used disk
2018-10-22 8:35 ` Avi Kivity
@ 2018-10-22 9:52 ` Dave Chinner
0 siblings, 0 replies; 26+ messages in thread
From: Dave Chinner @ 2018-10-22 9:52 UTC (permalink / raw)
To: Avi Kivity; +Cc: linux-xfs
On Mon, Oct 22, 2018 at 11:35:26AM +0300, Avi Kivity wrote:
>
> On 21/10/2018 17.28, Dave Chinner wrote:
> >On Sun, Oct 21, 2018 at 11:55:47AM +0300, Avi Kivity wrote:
> >>For sure fragmentation would have degraded performance sooner or
> >>later, but that's not as bad as that ENOSPC.
> >What it comes down to is that having looked into it, I don't know
> >why that ENOSPC error occurred.
> >
> >Alignment didn't cause it because alignment was being dropped - that
> >just caused free space fragmentation. Extent size hints didn't
> >cause it because the size hints were dropped - that just caused
> >freespace fragmentation. A lack of free space
> >didn't cause it, because there was heaps of free space in all
> >allocation groups.
> >
> >But something tickled a corner case that triggered an allocation
> >failure that was interpretted as ENOSPC rather than retrying the
> >allocation. Until I can reproduce the ENOSPC allocation failure
> >(and I tried!) then it'll be a mystery as to what caused it.
>
>
> The user reported the error happening multiple times, taking many
> hours to reproduce, but on more than one node. So it's an obscure
> corner case but not obscure enough to be a one-off event.
Yeah, as with all these sorts of things, the difficulty is in
reproducing it. I'll have a look through some of the higher level
code during the week to see if there's a min/max len condition I
missed somewhere that might lead to failure instead of a retry.
Because it shouldn't really fail at all because in the end a single
block allocation is allowable for normal extent size w/ alignemnt
allocation and there is heaps of free available.
> >>entire file. But I think that, given that the extent size is treated
> >>as a hint (or so I infer from the fact that we have <32MB extents),
> >>so should the alignment. Perhaps allocation with a hint should be
> >>performed in two passes, first trying to match size and alignment,
> >>and second relaxing both restrictions.
> >I think I already mentioned there were 5 separate attmepts to
> >allocate, each failure reducing restrictions:
> >
> >1. extent sized and contiguous to adjacent block in file
> >2. extent sized and aligned, at higher block in AG
> >3. extent sized, not aligned, at higher block in AG
> >4. >= minimum length, not aligned, anywhere in AG >= target AG
>
>
> Surprised at this one. Won't it skew usage in high AGs?
It's a constraint based on AG locking order. We always lock in
ascending AG order, so if we've locked AG 4 and modified the free
list in preparation for allocation, then failed to find an aligned
extent, that will remain locked until we finish the allocation
process and hence we can't lock AGs <= AG 4 otherwise we risk
deadlocking the allocator.....
> Perhaps it's rare enough not to matter.
It tends to be reare because we chose the ag ahead of time to ensure
that the majority of the time there is space available.
> Thanks for your patience in helping me understand this issue.
No worries, what I'm here for :)
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: ENSOPC on a 10% used disk
2018-10-17 7:52 ENSOPC on a 10% used disk Avi Kivity
` (2 preceding siblings ...)
2018-10-18 15:54 ` Eric Sandeen
@ 2019-02-05 21:48 ` Dave Chinner
2019-02-07 10:51 ` Avi Kivity
3 siblings, 1 reply; 26+ messages in thread
From: Dave Chinner @ 2019-02-05 21:48 UTC (permalink / raw)
To: Avi Kivity; +Cc: linux-xfs
Hi Avi,
On Wed, Oct 17, 2018 at 10:52:48AM +0300, Avi Kivity wrote:
> I have a user running a 1.7TB filesystem with ~10% usage (as shown
> by df), getting sporadic ENOSPC errors. The disk is mounted with
> inode64 and has a relatively small number of large files. The disk
> is a single-member RAID0 array, with 1MB chunk size. There are 32
> AGs. Running Linux 4.9.17.
>
>
> The write load consists of AIO/DIO writes, followed by unlinks of
> these files. The writes are non-size-changing (we truncate ahead)
> and we use XFS_IOC_FSSETXATTR/XFS_FLAG_EXTSIZE with a hint size of
> 32MB. The errors happen on commit logs, which have a target size of
> 32MB (but may exceed it a little).
>
>
> The errors are sporadic and after restarting the workload they go
> away for a few hours to a few days, but then return. During one of
> the crashes I used xfs_db to look at fragmentation and saw that most
> AGs had free extents of size categories up to 128-255, but a few had
> more. I tried xfs_fsr but it did not help.
>
>
> Is this a known issue? Would upgrading the kernel help?
Long time, I know, but Brian has just made me aware of this commit
from early 2018 that went into 4.16 that might be relevant and so I
thought it best to close the loop:
commit 6d8a45ce29c7d67cc4fc3016dc2a07660c62482a
Author: Darrick J. Wong <darrick.wong@oracle.com>
Date: Fri Jan 19 17:47:36 2018 -0800
xfs: don't screw up direct writes when freesp is fragmented
xfs_bmap_btalloc is given a range of file offset blocks that must be
allocated to some data/attr/cow fork. If the fork has an extent size
hint associated with it, the request will be enlarged on both ends to
try to satisfy the alignment hint. If free space is fragmentated,
sometimes we can allocate some blocks but not enough to fulfill any of
the requested range. Since bmapi_allocate always trims the new extent
mapping to match the originally requested range, this results in
bmapi_write returning zero and no mapping.
The consequences of this vary -- buffered writes will simply re-call
bmapi_write until it can satisfy at least one block from the original
request. Direct IO overwrites notice nmaps == 0 and return -ENOSPC
through the dio mechanism out to userspace with the weird result that
writes fail even when we have enough space because the ENOSPC return
overrides any partial write status. For direct CoW writes the situation
was disastrous because nobody notices us returning an invalid zero-length
wrong-offset mapping to iomap and the write goes off into space.
Therefore, if free space is so fragmented that we managed to allocate
some space but not enough to map into even a single block of the
original allocation request range, we should break the alignment hint in
order to guarantee at least some forward progress for the direct write.
If we return a short allocation to iomap_apply it'll call back about the
remaining blocks.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The spurious ENOSPC symptoms seem to match what you are seeing here
on your customer's 4.9 kernel, so it may be that this is the fix for
the ENOSPC problem that was reported. If this comes up again, then
perhaps it would be worth either upgrading the kernel to 4.16+ or
backporting this commit to see if it fixes the problem.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: ENSOPC on a 10% used disk
2019-02-05 21:48 ` Dave Chinner
@ 2019-02-07 10:51 ` Avi Kivity
0 siblings, 0 replies; 26+ messages in thread
From: Avi Kivity @ 2019-02-07 10:51 UTC (permalink / raw)
To: Dave Chinner; +Cc: linux-xfs
On 05/02/2019 23.48, Dave Chinner wrote:
> Hi Avi,
>
> On Wed, Oct 17, 2018 at 10:52:48AM +0300, Avi Kivity wrote:
>> I have a user running a 1.7TB filesystem with ~10% usage (as shown
>> by df), getting sporadic ENOSPC errors. The disk is mounted with
>> inode64 and has a relatively small number of large files. The disk
>> is a single-member RAID0 array, with 1MB chunk size. There are 32
>> AGs. Running Linux 4.9.17.
>>
>>
>> The write load consists of AIO/DIO writes, followed by unlinks of
>> these files. The writes are non-size-changing (we truncate ahead)
>> and we use XFS_IOC_FSSETXATTR/XFS_FLAG_EXTSIZE with a hint size of
>> 32MB. The errors happen on commit logs, which have a target size of
>> 32MB (but may exceed it a little).
>>
>>
>> The errors are sporadic and after restarting the workload they go
>> away for a few hours to a few days, but then return. During one of
>> the crashes I used xfs_db to look at fragmentation and saw that most
>> AGs had free extents of size categories up to 128-255, but a few had
>> more. I tried xfs_fsr but it did not help.
>>
>>
>> Is this a known issue? Would upgrading the kernel help?
> Long time, I know, but Brian has just made me aware of this commit
> from early 2018 that went into 4.16 that might be relevant and so I
> thought it best to close the loop:
>
> commit 6d8a45ce29c7d67cc4fc3016dc2a07660c62482a
> Author: Darrick J. Wong <darrick.wong@oracle.com>
> Date: Fri Jan 19 17:47:36 2018 -0800
>
> xfs: don't screw up direct writes when freesp is fragmented
>
> xfs_bmap_btalloc is given a range of file offset blocks that must be
> allocated to some data/attr/cow fork. If the fork has an extent size
> hint associated with it, the request will be enlarged on both ends to
> try to satisfy the alignment hint. If free space is fragmentated,
> sometimes we can allocate some blocks but not enough to fulfill any of
> the requested range. Since bmapi_allocate always trims the new extent
> mapping to match the originally requested range, this results in
> bmapi_write returning zero and no mapping.
>
> The consequences of this vary -- buffered writes will simply re-call
> bmapi_write until it can satisfy at least one block from the original
> request. Direct IO overwrites notice nmaps == 0 and return -ENOSPC
> through the dio mechanism out to userspace with the weird result that
> writes fail even when we have enough space because the ENOSPC return
> overrides any partial write status. For direct CoW writes the situation
> was disastrous because nobody notices us returning an invalid zero-length
> wrong-offset mapping to iomap and the write goes off into space.
>
> Therefore, if free space is so fragmented that we managed to allocate
> some space but not enough to map into even a single block of the
> original allocation request range, we should break the alignment hint in
> order to guarantee at least some forward progress for the direct write.
> If we return a short allocation to iomap_apply it'll call back about the
> remaining blocks.
>
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
>
> The spurious ENOSPC symptoms seem to match what you are seeing here
> on your customer's 4.9 kernel, so it may be that this is the fix for
> the ENOSPC problem that was reported. If this comes up again, then
> perhaps it would be worth either upgrading the kernel to 4.16+ or
> backporting this commit to see if it fixes the problem.
Thanks for remembering. Indeed it looks like a good match for the
problem. We did not see the problem again (it took quite a combination
of screwups to achieve), but I'll remember this in case that we do.
^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2019-02-07 10:51 UTC | newest]
Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-17 7:52 ENSOPC on a 10% used disk Avi Kivity
2018-10-17 8:47 ` Christoph Hellwig
2018-10-17 8:57 ` Avi Kivity
2018-10-17 10:54 ` Avi Kivity
2018-10-18 1:37 ` Dave Chinner
2018-10-18 7:55 ` Avi Kivity
2018-10-18 10:05 ` Dave Chinner
2018-10-18 11:00 ` Avi Kivity
2018-10-18 13:36 ` Avi Kivity
2018-10-19 7:51 ` Dave Chinner
2018-10-21 8:55 ` Avi Kivity
2018-10-21 14:28 ` Dave Chinner
2018-10-22 8:35 ` Avi Kivity
2018-10-22 9:52 ` Dave Chinner
2018-10-18 15:44 ` Avi Kivity
2018-10-18 16:11 ` Avi Kivity
2018-10-19 1:24 ` Dave Chinner
2018-10-21 9:00 ` Avi Kivity
2018-10-21 14:34 ` Dave Chinner
2018-10-19 1:15 ` Dave Chinner
2018-10-21 9:21 ` Avi Kivity
2018-10-21 15:06 ` Dave Chinner
2018-10-18 15:54 ` Eric Sandeen
2018-10-21 11:49 ` Avi Kivity
2019-02-05 21:48 ` Dave Chinner
2019-02-07 10:51 ` Avi Kivity
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.