* ENSOPC on a 10% used disk @ 2018-10-17 7:52 Avi Kivity 2018-10-17 8:47 ` Christoph Hellwig ` (3 more replies) 0 siblings, 4 replies; 26+ messages in thread From: Avi Kivity @ 2018-10-17 7:52 UTC (permalink / raw) To: linux-xfs I have a user running a 1.7TB filesystem with ~10% usage (as shown by df), getting sporadic ENOSPC errors. The disk is mounted with inode64 and has a relatively small number of large files. The disk is a single-member RAID0 array, with 1MB chunk size. There are 32 AGs. Running Linux 4.9.17. The write load consists of AIO/DIO writes, followed by unlinks of these files. The writes are non-size-changing (we truncate ahead) and we use XFS_IOC_FSSETXATTR/XFS_FLAG_EXTSIZE with a hint size of 32MB. The errors happen on commit logs, which have a target size of 32MB (but may exceed it a little). The errors are sporadic and after restarting the workload they go away for a few hours to a few days, but then return. During one of the crashes I used xfs_db to look at fragmentation and saw that most AGs had free extents of size categories up to 128-255, but a few had more. I tried xfs_fsr but it did not help. Is this a known issue? Would upgrading the kernel help? I'll try to get a metadata dump next time this happens, and I'll be happy to supply more information. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: ENSOPC on a 10% used disk 2018-10-17 7:52 ENSOPC on a 10% used disk Avi Kivity @ 2018-10-17 8:47 ` Christoph Hellwig 2018-10-17 8:57 ` Avi Kivity 2018-10-18 1:37 ` Dave Chinner ` (2 subsequent siblings) 3 siblings, 1 reply; 26+ messages in thread From: Christoph Hellwig @ 2018-10-17 8:47 UTC (permalink / raw) To: Avi Kivity; +Cc: linux-xfs On Wed, Oct 17, 2018 at 10:52:48AM +0300, Avi Kivity wrote: > I have a user running a 1.7TB filesystem with ~10% usage (as shown by df), > getting sporadic ENOSPC errors. The disk is mounted with inode64 and has a > relatively small number of large files. The disk is a single-member RAID0 > array, with 1MB chunk size. There are 32 AGs. Running Linux 4.9.17. 4.9.17 is rather old and you'll have a hard time finding someone familiar with it.. > Is this a known issue? Would upgrading the kernel help? Two things that come to mind: - are you sure there is no open fd to the unlinked files? That would keep the space allocated until the last link is dropped. - even once we drop the inode the space only becomes available once the transaction has committed. We do force the log if we found a busy extent, but there might be some issues. Try seeing if you hit the xfs_extent_busy_force trace point with your workload. - if you have online discard (-o discard) enabled there might be more issues like the above, especially on old kernels. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: ENSOPC on a 10% used disk 2018-10-17 8:47 ` Christoph Hellwig @ 2018-10-17 8:57 ` Avi Kivity 2018-10-17 10:54 ` Avi Kivity 0 siblings, 1 reply; 26+ messages in thread From: Avi Kivity @ 2018-10-17 8:57 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-xfs On 17/10/2018 11.47, Christoph Hellwig wrote: > On Wed, Oct 17, 2018 at 10:52:48AM +0300, Avi Kivity wrote: >> I have a user running a 1.7TB filesystem with ~10% usage (as shown by df), >> getting sporadic ENOSPC errors. The disk is mounted with inode64 and has a >> relatively small number of large files. The disk is a single-member RAID0 >> array, with 1MB chunk size. There are 32 AGs. Running Linux 4.9.17. > 4.9.17 is rather old and you'll have a hard time finding someone > familiar with it.. Yes. I expect my user will agree to upgrade, but I'd like to recommend this only if we know there was a real issue and it was resolved, not on general principles. >> Is this a known issue? Would upgrading the kernel help? > Two things that come to mind: > > - are you sure there is no open fd to the unlinked files? That would > keep the space allocated until the last link is dropped. "df" would report that space as occupied, no? I believe a colleague verified there were no deleted files but I'm not 100% sure. > - even once we drop the inode the space only becomes available once > the transaction has committed. We do force the log if we found > a busy extent, but there might be some issues. Try seeing if you > hit the xfs_extent_busy_force trace point with your workload. I'll ask permission to check this and report. > - if you have online discard (-o discard) enabled there might be > more issues like the above, especially on old kernels. Online discard is not enabled: /dev/md0 on /var/lib/scylla type xfs (rw,noatime,attr2,inode64,sunit=2048,swidth=2048,noquota) btw, we've seen fstrim on an old disk (that was likely never trimmed) improving its performance by a factor of ~100, so my interest in -o discard is re-awakening. Is it good enough now to to run on aio workloads (assuming nvme) or is more work needed? My prime concern is to avoid io_submit sleeping. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: ENSOPC on a 10% used disk 2018-10-17 8:57 ` Avi Kivity @ 2018-10-17 10:54 ` Avi Kivity 0 siblings, 0 replies; 26+ messages in thread From: Avi Kivity @ 2018-10-17 10:54 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-xfs On 17/10/2018 11.57, Avi Kivity wrote: > >> - even once we drop the inode the space only becomes available once >> the transaction has committed. We do force the log if we found >> a busy extent, but there might be some issues. Try seeing if you >> hit the xfs_extent_busy_force trace point with your workload. > > > I'll ask permission to check this and report. > > An hour's tracing yielded zero hits. Of course, that says nothing about other times, I'll continue to trace. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: ENSOPC on a 10% used disk 2018-10-17 7:52 ENSOPC on a 10% used disk Avi Kivity 2018-10-17 8:47 ` Christoph Hellwig @ 2018-10-18 1:37 ` Dave Chinner 2018-10-18 7:55 ` Avi Kivity 2018-10-18 15:54 ` Eric Sandeen 2019-02-05 21:48 ` Dave Chinner 3 siblings, 1 reply; 26+ messages in thread From: Dave Chinner @ 2018-10-18 1:37 UTC (permalink / raw) To: Avi Kivity; +Cc: linux-xfs On Wed, Oct 17, 2018 at 10:52:48AM +0300, Avi Kivity wrote: > I have a user running a 1.7TB filesystem with ~10% usage (as shown > by df), getting sporadic ENOSPC errors. The disk is mounted with > inode64 and has a relatively small number of large files. The disk > is a single-member RAID0 array, with 1MB chunk size. There are 32 > AGs. Running Linux 4.9.17. ENOSPC on what operation? write? open(O_CREAT)? something else? What's the filesystem config (xfs_info output)? > The write load consists of AIO/DIO writes, followed by unlinks of > these files. The writes are non-size-changing (we truncate ahead) > and we use XFS_IOC_FSSETXATTR/XFS_FLAG_EXTSIZE with a hint size of > 32MB. The errors happen on commit logs, which have a target size of > 32MB (but may exceed it a little). > > > The errors are sporadic and after restarting the workload they go > away for a few hours to a few days, but then return. During one of > the crashes I used xfs_db to look at fragmentation and saw that most > AGs had free extents of size categories up to 128-255, but a few had > more. I tried xfs_fsr but it did not help. 32MB extents are 8192 blocks. The bucket 128-255 records extents between 512k and 1MB in size, so it sounds like free space has been fragmented to death. Has xfs_fsr been run on this filesystem regularly? If the ENOSPC errors are only from files with a 32MB extent size hints on them, then it may be that there isn't sufficient contiguous free space to allocate an entire 32MB extent. I'm not sure what the allocator behaviour here is (the code is a maze of twisty passages), so I'll have to look more into this. In the mean time, can you post the output of the freespace command (both global and per-ag) so we can see just how much free space there is and how badly fragmented it has become? I might be able to reproduce the behaviour if I know the conditions under which it is occuring. > Is this a known issue? Would upgrading the kernel help? Not that I know of. If it's an extszhint vs free space fragmentation issue, then a kernel upgrade is unlikely to fix it. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: ENSOPC on a 10% used disk 2018-10-18 1:37 ` Dave Chinner @ 2018-10-18 7:55 ` Avi Kivity 2018-10-18 10:05 ` Dave Chinner 0 siblings, 1 reply; 26+ messages in thread From: Avi Kivity @ 2018-10-18 7:55 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs On 18/10/2018 04.37, Dave Chinner wrote: > On Wed, Oct 17, 2018 at 10:52:48AM +0300, Avi Kivity wrote: >> I have a user running a 1.7TB filesystem with ~10% usage (as shown >> by df), getting sporadic ENOSPC errors. The disk is mounted with >> inode64 and has a relatively small number of large files. The disk >> is a single-member RAID0 array, with 1MB chunk size. There are 32 >> AGs. Running Linux 4.9.17. > ENOSPC on what operation? write? open(O_CREAT)? something else? Unknown. > What's the filesystem config (xfs_info output)? (restored from metadata dump) meta-data=/dev/loop2 isize=512 agcount=32, agsize=14494720 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=0 spinodes=0 rmapbt=0 = reflink=0 data = bsize=4096 blocks=463831040, imaxpct=5 = sunit=256 swidth=256 blks naming =version 2 bsize=4096 ascii-ci=0 ftype=1 log =internal bsize=4096 blocks=226480, version=2 = sectsz=512 sunit=8 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 >> The write load consists of AIO/DIO writes, followed by unlinks of >> these files. The writes are non-size-changing (we truncate ahead) >> and we use XFS_IOC_FSSETXATTR/XFS_FLAG_EXTSIZE with a hint size of >> 32MB. The errors happen on commit logs, which have a target size of >> 32MB (but may exceed it a little). >> >> >> The errors are sporadic and after restarting the workload they go >> away for a few hours to a few days, but then return. During one of >> the crashes I used xfs_db to look at fragmentation and saw that most >> AGs had free extents of size categories up to 128-255, but a few had >> more. I tried xfs_fsr but it did not help. > 32MB extents are 8192 blocks. The bucket 128-255 records extents > between 512k and 1MB in size, so it sounds like free space has been > fragmented to death. Has xfs_fsr been run on this filesystem > regularly? xfs_fsr has never been run, until we saw the problem (and then did not fix it). IIUC the workload should be self-defragmenting: it consists of writing large files, then erasing them. I estimate that around 100 files are written concurrently (from 14 threads), and they are written with large extent hints. With every large file, another smaller (but still large) file is written, and a few smallish metadata files. I understood from xfs_fsr that it attempts to defragment files, not free space, although that may come as a side effect. In any case I ran xfs_db after xfs_fsr and did not see an improvement. > > If the ENOSPC errors are only from files with a 32MB extent size > hints on them, then it may be that there isn't sufficient contiguous > free space to allocate an entire 32MB extent. I'm not sure what the > allocator behaviour here is (the code is a maze of twisty passages), > so I'll have to look more into this. There are other files with 32MB hints that do not show the error (but on the other hand, the error has been observed few enough times for that to be a fluke). > > In the mean time, can you post the output of the freespace command > (both global and per-ag) so we can see just how much free space > there is and how badly fragmented it has become? I might be able to > reproduce the behaviour if I know the conditions under which it is > occuring. xfs_db> freesp from to extents blocks pct 1 1 5916 5916 0.00 2 3 10235 22678 0.01 4 7 12251 66829 0.02 8 15 5521 59556 0.01 16 31 5703 132031 0.03 32 63 9754 463825 0.11 64 127 16742 1590339 0.37 128 255 1550511 390108625 89.87 256 511 71516 29178504 6.72 512 1023 19 15355 0.00 1024 2047 287 461824 0.11 2048 4095 528 1611413 0.37 4096 8191 1537 10352304 2.38 8192 16383 2 19015 0.00 Just 2 extents >= 32MB (and they may have been freed after the error). Per-ag: from to extents blocks pct 1 1 390 390 0.00 2 3 542 1215 0.01 4 7 590 3211 0.02 8 15 265 2735 0.02 16 31 219 5000 0.04 32 63 323 15530 0.11 64 127 620 58217 0.43 128 255 48677 12254686 90.27 256 511 2981 1234365 9.09 from to extents blocks pct 1 1 542 542 0.00 2 3 646 1495 0.01 4 7 592 3122 0.02 8 15 525 5937 0.04 16 31 539 12280 0.09 32 63 691 33226 0.25 64 127 851 78277 0.59 128 255 46390 11658684 88.21 256 511 3335 1422955 10.77 from to extents blocks pct 1 1 560 560 0.00 2 3 642 1454 0.01 4 7 483 2552 0.02 8 15 368 4020 0.03 16 31 440 9947 0.08 32 63 540 25347 0.21 64 127 733 67944 0.56 128 255 42337 10632366 87.06 256 511 3386 1438609 11.78 512 1023 5 4423 0.04 1024 2047 5 8649 0.07 2048 4095 3 9205 0.08 4096 8191 1 8191 0.07 from to extents blocks pct 1 1 662 662 0.01 2 3 675 1545 0.02 4 7 490 2483 0.03 8 15 414 4485 0.05 16 31 445 9915 0.11 32 63 540 25279 0.29 64 127 683 63014 0.72 128 255 10061 2483774 28.34 256 511 1498 574685 6.56 512 1023 9 6715 0.08 1024 2047 5 6967 0.08 2048 4095 100 354101 4.04 4096 8191 786 5229818 59.68 from to extents blocks pct 1 1 642 642 0.01 2 3 705 1599 0.02 4 7 545 2801 0.04 8 15 407 4320 0.05 16 31 410 9396 0.12 32 63 513 24294 0.31 64 127 528 48217 0.61 128 255 2723 644939 8.17 256 511 875 326064 4.13 512 1023 5 4217 0.05 1024 2047 277 446208 5.65 2048 4095 425 1248107 15.81 4096 8191 750 5114295 64.79 8192 16383 2 19015 0.24 from to extents blocks pct 1 1 176 176 0.00 2 3 484 1228 0.01 4 7 825 4277 0.03 8 15 73 870 0.01 16 31 174 4155 0.03 32 63 356 16746 0.12 64 127 597 58761 0.42 128 255 55401 13814803 99.38 from to extents blocks pct 1 1 182 182 0.00 2 3 212 444 0.00 4 7 32 188 0.00 8 15 58 692 0.00 16 31 102 2369 0.02 32 63 243 11756 0.08 64 127 449 43271 0.30 128 255 53882 13618288 95.22 256 511 1550 625387 4.37 from to extents blocks pct 1 1 147 147 0.00 2 3 203 426 0.00 4 7 287 1585 0.01 8 15 84 958 0.01 16 31 105 2370 0.02 32 63 243 12073 0.09 64 127 497 47704 0.34 128 255 51847 13080484 94.15 256 511 1897 747986 5.38 from to extents blocks pct 1 1 81 81 0.00 2 3 129 262 0.00 4 7 186 1070 0.01 8 15 148 1781 0.01 16 31 225 5411 0.04 32 63 257 12226 0.09 64 127 492 46230 0.33 128 255 53802 13533984 95.16 256 511 1574 621876 4.37 from to extents blocks pct 1 1 159 159 0.00 2 3 191 398 0.00 4 7 182 1009 0.01 8 15 63 730 0.01 16 31 88 2006 0.01 32 63 191 9044 0.06 64 127 494 46669 0.33 128 255 53441 13451913 94.51 256 511 1850 720941 5.07 from to extents blocks pct 1 1 156 156 0.00 2 3 192 397 0.00 4 7 169 948 0.01 8 15 67 780 0.01 16 31 115 2948 0.02 32 63 272 12564 0.09 64 127 511 49124 0.35 128 255 53339 13427444 94.42 256 511 1866 726347 5.11 from to extents blocks pct 1 1 157 157 0.00 2 3 171 364 0.00 4 7 221 1215 0.01 8 15 45 504 0.00 16 31 116 2628 0.02 32 63 249 11827 0.08 64 127 474 47158 0.33 128 255 53261 13409025 94.35 256 511 1886 738689 5.20 from to extents blocks pct 1 1 142 142 0.00 2 3 181 395 0.00 4 7 323 1753 0.01 8 15 108 1176 0.01 16 31 134 3069 0.02 32 63 260 12055 0.08 64 127 411 39107 0.28 128 255 53197 13389340 94.39 256 511 1877 737582 5.20 from to extents blocks pct 1 1 137 137 0.00 2 3 174 386 0.00 4 7 222 1232 0.01 8 15 93 1012 0.01 16 31 96 2192 0.02 32 63 223 10763 0.08 64 127 493 47665 0.34 128 255 53125 13374075 94.17 256 511 1949 764710 5.38 from to extents blocks pct 1 1 59 59 0.00 2 3 138 309 0.00 4 7 224 1217 0.01 8 15 104 1211 0.01 16 31 138 3352 0.02 32 63 337 16480 0.12 64 127 585 55922 0.39 128 255 53654 13487724 95.05 256 511 1589 623688 4.40 from to extents blocks pct 1 1 121 121 0.00 2 3 264 597 0.00 4 7 706 3907 0.03 8 15 174 1802 0.01 16 31 94 2243 0.02 32 63 228 10806 0.08 64 127 495 47228 0.34 128 255 52078 13106646 93.94 256 511 1953 779417 5.59 from to extents blocks pct 1 1 107 107 0.00 2 3 174 370 0.00 4 7 248 1401 0.01 8 15 115 1318 0.01 16 31 111 2561 0.02 32 63 218 10243 0.07 64 127 443 42493 0.30 128 255 52320 13168357 94.43 256 511 1828 717948 5.15 from to extents blocks pct 1 1 126 126 0.00 2 3 353 793 0.01 4 7 774 4297 0.03 8 15 174 1767 0.01 16 31 129 3135 0.02 32 63 317 14569 0.11 64 127 506 48326 0.35 128 255 51507 12956078 93.58 256 511 2055 815607 5.89 from to extents blocks pct 1 1 118 118 0.00 2 3 207 448 0.00 4 7 299 1694 0.01 8 15 91 960 0.01 16 31 104 2394 0.02 32 63 358 17378 0.12 64 127 497 47351 0.34 128 255 52540 13229046 93.84 256 511 1971 798192 5.66 from to extents blocks pct 1 1 105 105 0.00 2 3 261 571 0.00 4 7 333 1851 0.01 8 15 100 1009 0.01 16 31 137 3323 0.02 32 63 261 12069 0.09 64 127 482 45103 0.32 128 255 51909 13060192 93.20 256 511 2226 889345 6.35 from to extents blocks pct 1 1 111 111 0.00 2 3 221 471 0.00 4 7 243 1341 0.01 8 15 101 1002 0.01 16 31 87 2145 0.02 32 63 265 12987 0.09 64 127 429 41335 0.29 128 255 51818 13031610 92.85 256 511 2312 944418 6.73 from to extents blocks pct 1 1 89 89 0.00 2 3 245 542 0.00 4 7 383 2114 0.02 8 15 107 1117 0.01 16 31 153 3505 0.03 32 63 237 11431 0.08 64 127 489 46582 0.33 128 255 51377 12929850 92.48 256 511 2412 986093 7.05 from to extents blocks pct 1 1 83 83 0.00 2 3 253 536 0.00 4 7 341 1902 0.01 8 15 118 1269 0.01 16 31 137 3201 0.02 32 63 235 11096 0.08 64 127 432 41041 0.30 128 255 51165 12882960 92.73 256 511 2348 951207 6.85 from to extents blocks pct 1 1 63 63 0.00 2 3 263 570 0.00 4 7 427 2392 0.02 8 15 143 1536 0.01 16 31 117 2714 0.02 32 63 217 10510 0.08 64 127 402 38021 0.27 128 255 50857 12803884 91.91 256 511 2583 1071722 7.69 from to extents blocks pct 1 1 69 69 0.00 2 3 302 645 0.00 4 7 343 1884 0.01 8 15 120 1234 0.01 16 31 133 3184 0.02 32 63 215 9971 0.07 64 127 506 49464 0.35 128 255 49778 12542384 89.34 256 511 3333 1429372 10.18 from to extents blocks pct 1 1 62 62 0.00 2 3 300 652 0.00 4 7 432 2413 0.02 8 15 173 1814 0.01 16 31 92 2119 0.02 32 63 253 12006 0.09 64 127 439 43006 0.31 128 255 49809 12539975 89.53 256 511 3298 1403687 10.02 from to extents blocks pct 1 1 52 52 0.00 2 3 283 608 0.00 4 7 253 1382 0.01 8 15 126 1353 0.01 16 31 117 2653 0.02 32 63 226 10856 0.08 64 127 462 43181 0.31 128 255 50799 12805008 90.86 256 511 2899 1228715 8.72 from to extents blocks pct 1 1 53 53 0.00 2 3 322 683 0.00 4 7 473 2658 0.02 8 15 206 2134 0.02 16 31 149 3494 0.03 32 63 251 12271 0.09 64 127 548 52541 0.38 128 255 50353 12685959 91.22 256 511 2753 1146454 8.24 from to extents blocks pct 1 1 46 46 0.00 2 3 309 655 0.00 4 7 373 2108 0.02 8 15 181 1951 0.01 16 31 161 3795 0.03 32 63 270 12433 0.09 64 127 434 41689 0.30 128 255 50963 12821420 91.99 256 511 2604 1054433 7.56 from to extents blocks pct 1 1 121 121 0.00 2 3 357 779 0.01 4 7 337 1825 0.01 8 15 220 2378 0.02 16 31 181 4124 0.03 32 63 297 13987 0.10 64 127 571 53694 0.39 128 255 49880 12560088 91.06 256 511 2792 1155483 8.38 from to extents blocks pct 1 1 235 235 0.00 2 3 439 964 0.01 4 7 448 2445 0.02 8 15 275 2842 0.02 16 31 221 4979 0.04 32 63 332 15967 0.12 64 127 596 56251 0.41 128 255 48484 12208089 89.11 256 511 3341 1408614 10.28 from to extents blocks pct 1 1 163 163 0.00 2 3 397 877 0.01 4 7 467 2552 0.02 8 15 275 2859 0.02 16 31 234 5424 0.04 32 63 336 16035 0.12 64 127 593 55753 0.41 128 255 49737 12515550 91.40 256 511 2695 1093913 7.99 >> Is this a known issue? Would upgrading the kernel help? > Not that I know of. If it's an extszhint vs free space fragmentation > issue, then a kernel upgrade is unlikely to fix it. > > Cheers, > > Dave. > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: ENSOPC on a 10% used disk 2018-10-18 7:55 ` Avi Kivity @ 2018-10-18 10:05 ` Dave Chinner 2018-10-18 11:00 ` Avi Kivity 0 siblings, 1 reply; 26+ messages in thread From: Dave Chinner @ 2018-10-18 10:05 UTC (permalink / raw) To: Avi Kivity; +Cc: linux-xfs [ hmmm, there's some whacky utf-8 whitespace characters in the copy-n-pasted text... ] On Thu, Oct 18, 2018 at 10:55:18AM +0300, Avi Kivity wrote: > > On 18/10/2018 04.37, Dave Chinner wrote: > >On Wed, Oct 17, 2018 at 10:52:48AM +0300, Avi Kivity wrote: > >>I have a user running a 1.7TB filesystem with ~10% usage (as shown > >>by df), getting sporadic ENOSPC errors. The disk is mounted with > >>inode64 and has a relatively small number of large files. The disk > >>is a single-member RAID0 array, with 1MB chunk size. There are 32 Ok, now I need to know what "single member RAID0 array" means, becuase this is clearly related to allocation alignment and I need to know why the FS was configured the way it was. It's one disk? Or is it a hardware RAID0 array that presents as a single lun with a stripe width of 1MB? if so, how many disks aer in it? If the chunk size the stripe unit (per disk chunk size) or the stripe width (all disks get hit by a 1MB IO) Or something else? > >>AGs. Running Linux 4.9.17. > >ENOSPC on what operation? write? open(O_CREAT)? something else? > > > Unknown. > > > >What's the filesystem config (xfs_info output)? > > > (restored from metadata dump) > > > meta-data=/dev/loop2 isize=512 agcount=32, agsize=14494720 blks > = sectsz=512 attr=2, projid32bit=1 > = crc=1 finobt=0 spinodes=0 rmapbt=0 > = reflink=0 > data = bsize=4096 blocks=463831040, imaxpct=5 > = sunit=256 swidth=256 blks sunit=swidth is unusual for a RAID0 array, unless it's hardware RAID and the array only reports one number to mkfs. What this chosen by mkfs, or specifically configured by the user? If specifically configured, why? What is important is that it means aligned allocations will be used for any allocation that is over sunit (1MB) and that's where all the problems seem to come from. > naming =version 2 bsize=4096 ascii-ci=0 ftype=1 > log =internal bsize=4096 blocks=226480, version=2 > = sectsz=512 sunit=8 blks, lazy-count=1 > realtime =none extsz=4096 blocks=0, rtextents=0 > > > Has xfs_fsr been run on this filesystem > >regularly? > > > xfs_fsr has never been run, until we saw the problem (and then did > not fix it). IIUC the workload should be self-defragmenting: it > consists of writing large files, then erasing them. I estimate that > around 100 files are written concurrently (from 14 threads), and > they are written with large extent hints. With every large file, > another smaller (but still large) file is written, and a few > smallish metadata files. Do those smaller files get removed when the big files are removed? > I understood from xfs_fsr that it attempts to defragment files, not > free space, although that may come as a side effect. In any case I > ran xfs_db after xfs_fsr and did not see an improvement. xfs_fsr takes fragmented files and contiguous free space and turns it into contiguous files and fragmented free space. You have fragmented free space, so I needed to know if xfs_fsr was responsible for that.... > >If the ENOSPC errors are only from files with a 32MB extent size > >hints on them, then it may be that there isn't sufficient contiguous > >free space to allocate an entire 32MB extent. I'm not sure what the > >allocator behaviour here is (the code is a maze of twisty passages), > >so I'll have to look more into this. > > There are other files with 32MB hints that do not show the error > (but on the other hand, the error has been observed few enough times > for that to be a fluke). *nod* > >In the mean time, can you post the output of the freespace command > >(both global and per-ag) so we can see just how much free space > >there is and how badly fragmented it has become? I might be able to > >reproduce the behaviour if I know the conditions under which it is > >occuring. > > > xfs_db> freesp > from to extents blocks pct > 1 1 5916 5916 0.00 > 2 3 10235 22678 0.01 > 4 7 12251 66829 0.02 > 8 15 5521 59556 0.01 > 16 31 5703 132031 0.03 > 32 63 9754 463825 0.11 > 64 127 16742 1590339 0.37 > 128 255 550511 390108625 89.87 > 256 511 71516 29178504 6.72 > 512 1023 19 15355 0.00 > 1024 2047 287 461824 0.11 > 2048 4095 528 1611413 0.37 > 4096 8191 1537 10352304 2.38 > 8192 16383 2 19015 0.00 > > Just 2 extents >= 32MB (and they may have been freed after the error). Yes, and the vast majority of free space is in lengths between 512kB and 1020kB. This is what I'd expect if you have large, stripe aligned allocations interleaved with smaller, sub-stripe unit allocations. As an example of behaviour that can leads to this sort of free space fragmentation, start with 10 stripe units of contiguous free space: 0 1 2 3 4 5 6 7 8 9 10 +----+----+----+----+----+----+----+----+----+----+----+ Now allocate a > stripe unit extent (say 2 units): 0 1 2 3 4 5 6 7 8 9 10 LLLLLLLLLL+----+----+----+----+----+----+----+----+----+ Now allocate a small file A: 0 1 2 3 4 5 6 7 8 9 10 LLLLLLLLLLAA---+----+----+----+----+----+----+----+----+ Now allocate another large extent: 0 1 2 3 4 5 6 7 8 9 10 LLLLLLLLLLAA---LLLLLLLLLL+----+----+----+----+----+----+ After a while, a significant part of your filesystem looks like this repeating pattern: 0 1 2 3 4 5 6 7 8 9 10 LLLLLLLLLLAA---LLLLLLLLLLBB---LLLLLLLLLLCC---LLLLLLLLLLDD---+ i.e. there are lots of small, isolated sub stripe unit free spaces. If you now start removing large extents but leaving the small files behind, you end up with this: 0 1 2 3 4 5 6 7 8 9 10 LLLLLLLLLLAA---+---------BB---LLLLLLLLLLCC---+----+----DD---+ And now we go to allocate a new large+small file pair (M+n) they'll get laid out like this: 0 1 2 3 4 5 6 7 8 9 10 LLLLLLLLLLAA---MMMMMMMMMMBB---LLLLLLLLLLCC---nn---+----DD---+ See how we lost a large aligned 2MB freespace @ 9 when the small file "nn" was laid down? repeat this fill and free pattern over and over again, and eventually it fragments the free space until there's no large contiguous free spaces left, and large aligned extents can no longer be allocated. For this to trigger you need the small files to be larger than 1 stripe unit, but still much smaller than the extent size hint, and the small files need to hang around as the large files come and go. > >>Is this a known issue? The effect and symptom is - it's a generic large aligned extent vs small unaligned extent issue, but I've never seen it manifest in a user workload outside of a very constrained multistream realtime video ingest/playout workload (i.e. the workload the filestreams allocator was written for). And before you ask, no, the filestreams allocator does not solve this problem. The most common manifestation of this problem has been inode allocation on filesystems full of small files - inodes are allocated in large aligned extents compared to small files, and so eventually the filesystem runs out of large contigouous freespace and inodes can't be allocated. The sparse inodes mkfs option fixed this by allowing inodes to be allocated as sparse chunks so they could interleave into any free space available.... > >>Would upgrading the kernel help? > >Not that I know of. If it's an extszhint vs free space fragmentation > >issue, then a kernel upgrade is unlikely to fix it. Upgrading the kernel won't fix it, because it's an extszhint vs free space fragmentation issue. Filesystems that get into this state are generally considered unrecoverable. Well, you can recover them by deleting everythign from them to reform contiguous free space, but you may as well just mkfs and restore from backup because it's much, much faster than waiting for rm -rf.... And, really, I expect that a different filesystem geometry and/or mount options are going to be needed to avoid getting into this state again. However, I don't really know enough yet about what in the workload and allocator is triggering to cause the issue to say yet. Can I get access to the metadump to dig around in the filesystem directly so I can see how everything has ended up laid out? that will help me work out what is actually occurring and determine if mkfs/mount options can address the problem or whether deeper allocator algorithm changes may be necessary.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: ENSOPC on a 10% used disk 2018-10-18 10:05 ` Dave Chinner @ 2018-10-18 11:00 ` Avi Kivity 2018-10-18 13:36 ` Avi Kivity ` (2 more replies) 0 siblings, 3 replies; 26+ messages in thread From: Avi Kivity @ 2018-10-18 11:00 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs On 18/10/2018 13.05, Dave Chinner wrote: > [ hmmm, there's some whacky utf-8 whitespace characters in the > copy-n-pasted text... ] It's a brave new world out there. > On Thu, Oct 18, 2018 at 10:55:18AM +0300, Avi Kivity wrote: >> On 18/10/2018 04.37, Dave Chinner wrote: >>> On Wed, Oct 17, 2018 at 10:52:48AM +0300, Avi Kivity wrote: >>>> I have a user running a 1.7TB filesystem with ~10% usage (as shown >>>> by df), getting sporadic ENOSPC errors. The disk is mounted with >>>> inode64 and has a relatively small number of large files. The disk >>>> is a single-member RAID0 array, with 1MB chunk size. There are 32 > Ok, now I need to know what "single member RAID0 array" means, > becuase this is clearly related to allocation alignment and I need > to know why the FS was configured the way it was. It's a Linux RAID device, /dev/md0. We configure it this way so that it's easy to add storage (okay, the real reason is probably to avoid special casing one drive). > > It's one disk? Or is it a hardware RAID0 array that presents as a > single lun with a stripe width of 1MB? if so, how many disks aer in > it? If the chunk size the stripe unit (per disk chunk size) or the > stripe width (all disks get hit by a 1MB IO) > > Or something else? One disk, organized into a Linux RAID device with just one member. > >>>> AGs. Running Linux 4.9.17. >>> ENOSPC on what operation? write? open(O_CREAT)? something else? >> >> Unknown. >> >> >>> What's the filesystem config (xfs_info output)? >> >> (restored from metadata dump) >> >> >> meta-data=/dev/loop2 isize=512 agcount=32, agsize=14494720 blks >> = sectsz=512 attr=2, projid32bit=1 >> = crc=1 finobt=0 spinodes=0 rmapbt=0 >> = reflink=0 >> data = bsize=4096 blocks=463831040, imaxpct=5 >> = sunit=256 swidth=256 blks > sunit=swidth is unusual for a RAID0 array, unless it's hardware RAID > and the array only reports one number to mkfs. What this chosen by > mkfs, or specifically configured by the user? If specifically > configured, why? I'm guessing it's because it has one member? I'm guessing the usual is swidth=sunit*nmembers? Maybe that configuration confused xfs? Although we've been using it on many instances. > > What is important is that it means aligned allocations will be used > for any allocation that is over sunit (1MB) and that's where all the > problems seem to come from. Do these aligned allocations not fall back to non-aligned allocations if they fail? > >> naming =version 2 bsize=4096 ascii-ci=0 ftype=1 >> log =internal bsize=4096 blocks=226480, version=2 >> = sectsz=512 sunit=8 blks, lazy-count=1 >> realtime =none extsz=4096 blocks=0, rtextents=0 >> >>> Has xfs_fsr been run on this filesystem >>> regularly? >> >> xfs_fsr has never been run, until we saw the problem (and then did >> not fix it). IIUC the workload should be self-defragmenting: it >> consists of writing large files, then erasing them. I estimate that >> around 100 files are written concurrently (from 14 threads), and >> they are written with large extent hints. With every large file, >> another smaller (but still large) file is written, and a few >> smallish metadata files. > Do those smaller files get removed when the big files are removed? Yes. It's more or less like this: 1. Create two big files, with 32MB hints 2. Append to the two files, using 128k AIO/DIO writes. We truncate ahead so those writes are not size-changing. 3. Truncate those files to their final size, write ~5 much smaller files using the same pattern 4. A bunch of fdatasyncs, renames, and directory fdatasyncs 5. The two big files get random reads for a random while 6. All files are unlinked (with some rename and directory fdatasyncs so we can recover if we crash while doing that) 7. Rinse, repeat. The whole things happens in parallel for similar and different filesizes and lifetimes. The commitlog files (for which we've seen the error) are simpler: create a file with 32MB extent hint, truncate to 32MB size, lots of writes (which may not all be 128k). > >> I understood from xfs_fsr that it attempts to defragment files, not >> free space, although that may come as a side effect. In any case I >> ran xfs_db after xfs_fsr and did not see an improvement. > xfs_fsr takes fragmented files and contiguous free space and turns > it into contiguous files and fragmented free space. You have > fragmented free space, so I needed to know if xfs_fsr was > responsible for that.... I see. > >>> If the ENOSPC errors are only from files with a 32MB extent size >>> hints on them, then it may be that there isn't sufficient contiguous >>> free space to allocate an entire 32MB extent. I'm not sure what the >>> allocator behaviour here is (the code is a maze of twisty passages), >>> so I'll have to look more into this. >> There are other files with 32MB hints that do not show the error >> (but on the other hand, the error has been observed few enough times >> for that to be a fluke). > *nod* > >>> In the mean time, can you post the output of the freespace command >>> (both global and per-ag) so we can see just how much free space >>> there is and how badly fragmented it has become? I might be able to >>> reproduce the behaviour if I know the conditions under which it is >>> occuring. >> >> xfs_db> freesp >> from to extents blocks pct >> 1 1 5916 5916 0.00 >> 2 3 10235 22678 0.01 >> 4 7 12251 66829 0.02 >> 8 15 5521 59556 0.01 >> 16 31 5703 132031 0.03 >> 32 63 9754 463825 0.11 >> 64 127 16742 1590339 0.37 >> 128 255 550511 390108625 89.87 >> 256 511 71516 29178504 6.72 >> 512 1023 19 15355 0.00 >> 1024 2047 287 461824 0.11 >> 2048 4095 528 1611413 0.37 >> 4096 8191 1537 10352304 2.38 >> 8192 16383 2 19015 0.00 >> >> Just 2 extents >= 32MB (and they may have been freed after the error). > Yes, and the vast majority of free space is in lengths between 512kB > and 1020kB. This is what I'd expect if you have large, stripe > aligned allocations interleaved with smaller, sub-stripe unit > allocations. > > As an example of behaviour that can leads to this sort of free space > fragmentation, start with 10 stripe units of contiguous free space: > > 0 1 2 3 4 5 6 7 8 9 10 > +----+----+----+----+----+----+----+----+----+----+----+ > > Now allocate a > stripe unit extent (say 2 units): > > 0 1 2 3 4 5 6 7 8 9 10 > LLLLLLLLLL+----+----+----+----+----+----+----+----+----+ > > Now allocate a small file A: > > 0 1 2 3 4 5 6 7 8 9 10 > LLLLLLLLLLAA---+----+----+----+----+----+----+----+----+ > > Now allocate another large extent: > > 0 1 2 3 4 5 6 7 8 9 10 > LLLLLLLLLLAA---LLLLLLLLLL+----+----+----+----+----+----+ > > After a while, a significant part of your filesystem looks like > this repeating pattern: > > 0 1 2 3 4 5 6 7 8 9 10 > LLLLLLLLLLAA---LLLLLLLLLLBB---LLLLLLLLLLCC---LLLLLLLLLLDD---+ > > i.e. there are lots of small, isolated sub stripe unit free spaces. > If you now start removing large extents but leaving the small > files behind, you end up with this: > > 0 1 2 3 4 5 6 7 8 9 10 > LLLLLLLLLLAA---+---------BB---LLLLLLLLLLCC---+----+----DD---+ > > And now we go to allocate a new large+small file pair (M+n) > they'll get laid out like this: > > 0 1 2 3 4 5 6 7 8 9 10 > LLLLLLLLLLAA---MMMMMMMMMMBB---LLLLLLLLLLCC---nn---+----DD---+ > > See how we lost a large aligned 2MB freespace @ 9 when the small > file "nn" was laid down? repeat this fill and free pattern over and > over again, and eventually it fragments the free space until there's > no large contiguous free spaces left, and large aligned extents can > no longer be allocated. > > For this to trigger you need the small files to be larger than 1 > stripe unit, but still much smaller than the extent size hint, and > the small files need to hang around as the large files come and go. This can happen, and indeed I see our default hint is 1MB, so our small files use a 1MB hint. Looks like we should remove that 1MB hint since it's reducing allocation flexibility for XFS without a good return. On the other hand, I worry that because we bypass the page cache, XFS doesn't get to see the entire file at one time and so it will get fragmented. Suppose I write a 4k file with a 1MB hint. How is that trailing (1MB-4k) marked? Free extent, free extent with extra annotation, or allocated extent? We may need to deallocate those extents? (will FALLOC_FL_PUNCH_HOLE do the trick?) > >>>> Is this a known issue? > The effect and symptom is - it's a generic large aligned extent vs small unaligned extent > issue, but I've never seen it manifest in a user workload outside of > a very constrained multistream realtime video ingest/playout > workload (i.e. the workload the filestreams allocator was written > for). And before you ask, no, the filestreams allocator does not > solve this problem. > > The most common manifestation of this problem has been inode > allocation on filesystems full of small files - inodes are allocated > in large aligned extents compared to small files, and so eventually > the filesystem runs out of large contigouous freespace and inodes > can't be allocated. The sparse inodes mkfs option fixed this by > allowing inodes to be allocated as sparse chunks so they could > interleave into any free space available.... Shouldn't XFS fall back to a non-aligned allocation rather that returning ENOSPC on a filesystem with 90% free space? > >>>> Would upgrading the kernel help? >>> Not that I know of. If it's an extszhint vs free space fragmentation >>> issue, then a kernel upgrade is unlikely to fix it. > Upgrading the kernel won't fix it, because it's an extszhint vs free > space fragmentation issue. > > Filesystems that get into this state are generally considered > unrecoverable. Well, you can recover them by deleting everythign > from them to reform contiguous free space, but you may as well just > mkfs and restore from backup because it's much, much faster than > waiting for rm -rf.... > > And, really, I expect that a different filesystem geometry and/or > mount options are going to be needed to avoid getting into this > state again. However, I don't really know enough yet about what in > the workload and allocator is triggering to cause the issue to say > yet. > > Can I get access to the metadump to dig around in the filesystem > directly so I can see how everything has ended up laid out? that > will help me work out what is actually occurring and determine if > mkfs/mount options can address the problem or whether deeper > allocator algorithm changes may be necessary.... I will ask permission to share the dump. Thanks a lot for all the explanations and help. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: ENSOPC on a 10% used disk 2018-10-18 11:00 ` Avi Kivity @ 2018-10-18 13:36 ` Avi Kivity 2018-10-19 7:51 ` Dave Chinner 2018-10-18 15:44 ` Avi Kivity 2018-10-19 1:15 ` Dave Chinner 2 siblings, 1 reply; 26+ messages in thread From: Avi Kivity @ 2018-10-18 13:36 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs On 18/10/2018 14.00, Avi Kivity wrote: > >> Can I get access to the metadump to dig around in the filesystem >> directly so I can see how everything has ended up laid out? that >> will help me work out what is actually occurring and determine if >> mkfs/mount options can address the problem or whether deeper >> allocator algorithm changes may be necessary.... > > > I will ask permission to share the dump. > > > I'll send you a link privately. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: ENSOPC on a 10% used disk 2018-10-18 13:36 ` Avi Kivity @ 2018-10-19 7:51 ` Dave Chinner 2018-10-21 8:55 ` Avi Kivity 0 siblings, 1 reply; 26+ messages in thread From: Dave Chinner @ 2018-10-19 7:51 UTC (permalink / raw) To: Avi Kivity; +Cc: linux-xfs On Thu, Oct 18, 2018 at 04:36:42PM +0300, Avi Kivity wrote: > On 18/10/2018 14.00, Avi Kivity wrote: > >>Can I get access to the metadump to dig around in the filesystem > >>directly so I can see how everything has ended up laid out? that > >>will help me work out what is actually occurring and determine if > >>mkfs/mount options can address the problem or whether deeper > >>allocator algorithm changes may be necessary.... > > > >I will ask permission to share the dump. > > I'll send you a link privately. Thanks - I've started looking at this - the information here is just layout stuff - I'm omitted filenames and anything else that might be identifying from the output. Looking at a commit log file: stat.size = 33554432 stat.blocks = 34720 fsxattr.xflags = 0x800 [----------e-----] fsxattr.projid = 0 fsxattr.extsize = 33554432 fsxattr.cowextsize = 0 fsxattr.nextents = 14 and the layout: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..4079]: 2646677520..2646681599 22 (95606800..95610879) 4080 001010 1: [4080..8159]: 2643130384..2643134463 22 (92059664..92063743) 4080 001010 2: [8160..12239]: 2642124816..2642128895 22 (91054096..91058175) 4080 001010 3: [12240..16319]: 2640666640..2640670719 22 (89595920..89599999) 4080 001010 4: [16320..18367]: 2640523264..2640525311 22 (89452544..89454591) 2048 000000 5: [18368..20415]: 2640119808..2640121855 22 (89049088..89051135) 2048 000000 6: [20416..21287]: 2639874064..2639874935 22 (88803344..88804215) 872 001111 7: [21288..21295]: 2639874936..2639874943 22 (88804216..88804223) 8 011111 8: [21296..24495]: 2639874944..2639878143 22 (88804224..88807423) 3200 001010 9: [24496..26543]: 2639427584..2639429631 22 (88356864..88358911) 2048 000000 10: [26544..28591]: 2638981120..2638983167 22 (87910400..87912447) 2048 000000 11: [28592..30639]: 2638770176..2638772223 22 (87699456..87701503) 2048 000000 12: [30640..31279]: 2638247952..2638248591 22 (87177232..87177871) 640 001111 13: [31280..34719]: 2638248592..2638252031 22 (87177872..87181311) 3440 011010 14: [34720..65535]: hole 30816 The first thing I note is the initial allocations are just short of 2MB and so the extent size hint is, indeed, being truncated here according to contiguous free space limitations. I had thought that should occur from reading the code, but it's complex and I wasn't 100% certain what minimum allocation length would be used. Looking at the system batchlog files, I'm guessing the filesystem ran out of contiguous 32MB free space extents some time around September 25. The *Data.db files from 24 Sep and earlier then are all nice 32MB extents, from 25 sep onwards they never make the full 32MB (30-31MB max). eg, good: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..65535]: 350524552..350590087 3 (2651272..2716807) 65536 001111 1: [65536..131071]: 353378024..353443559 3 (5504744..5570279) 65536 001111 2: [131072..196607]: 355147016..355212551 3 (7273736..7339271) 65536 001111 3: [196608..262143]: 360029416..360094951 3 (12156136..12221671) 65536 001111 4: [262144..327679]: 362244144..362309679 3 (14370864..14436399) 65536 001111 5: [327680..343415]: 365809456..365825191 3 (17936176..17951911) 15736 001111 bad: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..64127]: 512855496..512919623 4 (49024456..49088583) 64128 001111 1: [64128..128247]: 266567048..266631167 2 (34651528..34715647) 64120 001010 2: [128248..142327]: 264401888..264415967 2 (32486368..32500447) 14080 001111 Hmmm - there's 2 million files in this filesystem. that is quite a lot... Ok... I see where all the files are - there's a db that was snapshotted every half hour going back to December 19 2017. There's 55GB of snapshot data there 14362 snapshots holding in 1.8million files. Ok, now I understand how the filesystem got into this mess. It has nothing really to do with the filesystem allocator, geometry, extent size hints, etc. It isn't really even an XFS specific problem - I think most filesystems would be in trouble if you did this to them. First, let me demonstrate that the freespace fragmentation is caused by these snapshots by removing them all: before: from to extents blocks pct 1 1 5916 5916 0.00 2 3 10235 22678 0.01 4 7 12251 66829 0.02 8 15 5521 59556 0.01 16 31 5703 132031 0.03 32 63 9754 463825 0.11 64 127 16742 1590339 0.37 128 255 1550511 390108625 89.87 256 511 71516 29178504 6.72 512 1023 19 15355 0.00 1024 2047 287 461824 0.11 2048 4095 528 1611413 0.37 4096 8191 1537 10352304 2.38 8192 16383 2 19015 0.00 Run a delete: for d in snapshots/*; do rm -rf $d & done <cranking along at ~12,000 write iops> # uptime 17:41:08 up 22:07, 1 user, load average: 14293.17, 13840.37, 9517.14 # 500,000 files removed: from to extents blocks pct 64 127 22564 2054234 0.47 128 255 900480 226428059 51.43 256 511 189904 91033237 20.68 512 1023 68304 54958788 12.48 1024 2047 25187 38284024 8.70 2048 4095 5508 15204528 3.45 4096 8191 1665 10999789 2.50 8192 16383 15 139424 0.03 1m files removed: from to extents blocks pct 64 127 21940 1991685 0.45 128 255 536985 134731402 30.35 256 511 152092 73465972 16.55 512 1023 100471 82971130 18.69 1024 2047 48519 74016490 16.67 2048 4095 17272 49209538 11.09 4096 8191 4307 25135374 5.66 8192 16383 135 1254037 0.28 1.5m files removed: from to extents blocks pct 64 127 9851 924782 0.20 128 255 227945 57079302 12.32 256 511 38723 18129086 3.91 512 1023 33547 28027554 6.05 1024 2047 31904 50171699 10.83 2048 4095 25263 75381887 16.27 4096 8191 16885 102836365 22.19 8192 16383 6367 68809645 14.85 16384 32767 1862 40183775 8.67 32768 65535 385 16228869 3.50 65536 131071 51 4213237 0.91 131072 262143 6 958528 0.21 after: from to extents blocks pct 128 255 154063 38785829 8.64 256 511 11037 4942114 1.10 512 1023 8576 6930035 1.54 1024 2047 8496 13464298 3.00 2048 4095 7664 23034455 5.13 4096 8191 8497 55217061 12.31 8192 16383 4233 45867691 10.22 16384 32767 1533 33488995 7.46 32768 65535 520 23924895 5.33 65536 131071 305 28675646 6.39 131072 262143 230 42411732 9.45 262144 524287 98 37213190 8.29 524288 1048575 41 29163579 6.50 1048576 2097151 27 40502889 9.03 2097152 4194303 5 14576157 3.25 4194304 8388607 2 10005670 2.23 Ok, so the results is not perfect, but there are now huge contiguous free space extents available again - ~70% of the free space is now contiguous extents >=32MB in length. There's every chance that the fs would confinue to help reform large contiguous free spaces as the database files come and go now, as long as the snapshot problem is dealt with. So, what's the problem? Well, it's simply that the workload is mixing data with vastly different temporal characteristics in the same physical locality. Every half an hour, a set of ~100 smallish files are written into a new directory which lands them at the low endof the largest free space extent in that AG. Each new snapshot directory ends up in a different AG, so it slowly spreads the snapshots across all the AGs in the filesystem. Each snapshot effective appends to the current working area in the AG, chopping it out of the largest contiguous free space. By the time the next snapshot in that AG comes around, there's other new short term data between the old snapshot and the new one. The new snapshot chops up the largest freespace, and on goes the cycle. Eventually the short term data between the snapshots gets removed, but this doesn't reform large contiguous free spaces because the snapshot data is in the way. And so this cycle continues with the snapshot data chopping up the largest freespace extents in the filesystem until there's not more large free space extents to be found. The solution is to manage the snapshot data better. We need to keep all the long term data physically isolated from the short term data so they don't fragment free space. A short term application level solution would require migrating the snapshot data out of the filesystem to somewhere else and point to it with symlinks. >From the filesystem POV, I'm not sure that there is much we can do about this directly - we have no idea what the lifetime of the data is going to be.... <ding> Hold on.... <rummage in code> ....we already have an interface so setting those sorts of hints. fcntl(F_SET_RW_HINT, rw_hint) /* * Valid hint values for F_{GET,SET}_RW_HINT. 0 is "not set", or can be * used to clear any hints previously set. */ #define RWF_WRITE_LIFE_NOT_SET 0 #define RWH_WRITE_LIFE_NONE 1 #define RWH_WRITE_LIFE_SHORT 2 #define RWH_WRITE_LIFE_MEDIUM 3 #define RWH_WRITE_LIFE_LONG 4 #define RWH_WRITE_LIFE_EXTREME 5 Avi, does this sound like something that you could use to classify the different types of data the data base writes out? I'll need to have a think about how to apply this to the allocator policy algorithms before going any further, but I suspect making use of this hint interface will allow us prevent interleaving of short and long term data so avoid the freespace fragmentation it is causing here.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: ENSOPC on a 10% used disk 2018-10-19 7:51 ` Dave Chinner @ 2018-10-21 8:55 ` Avi Kivity 2018-10-21 14:28 ` Dave Chinner 0 siblings, 1 reply; 26+ messages in thread From: Avi Kivity @ 2018-10-21 8:55 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs On 19/10/2018 10.51, Dave Chinner wrote: > On Thu, Oct 18, 2018 at 04:36:42PM +0300, Avi Kivity wrote: >> On 18/10/2018 14.00, Avi Kivity wrote: >>>> Can I get access to the metadump to dig around in the filesystem >>>> directly so I can see how everything has ended up laid out? that >>>> will help me work out what is actually occurring and determine if >>>> mkfs/mount options can address the problem or whether deeper >>>> allocator algorithm changes may be necessary.... >>> I will ask permission to share the dump. >> I'll send you a link privately. > Thanks - I've started looking at this - the information here is > just layout stuff - I'm omitted filenames and anything else that > might be identifying from the output. > > Looking at a commit log file: > > stat.size = 33554432 > stat.blocks = 34720 > fsxattr.xflags = 0x800 [----------e-----] > fsxattr.projid = 0 > fsxattr.extsize = 33554432 > fsxattr.cowextsize = 0 > fsxattr.nextents = 14 > > > and the layout: > > EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS > 0: [0..4079]: 2646677520..2646681599 22 (95606800..95610879) 4080 001010 > 1: [4080..8159]: 2643130384..2643134463 22 (92059664..92063743) 4080 001010 > 2: [8160..12239]: 2642124816..2642128895 22 (91054096..91058175) 4080 001010 > 3: [12240..16319]: 2640666640..2640670719 22 (89595920..89599999) 4080 001010 > 4: [16320..18367]: 2640523264..2640525311 22 (89452544..89454591) 2048 000000 > 5: [18368..20415]: 2640119808..2640121855 22 (89049088..89051135) 2048 000000 > 6: [20416..21287]: 2639874064..2639874935 22 (88803344..88804215) 872 001111 > 7: [21288..21295]: 2639874936..2639874943 22 (88804216..88804223) 8 011111 > 8: [21296..24495]: 2639874944..2639878143 22 (88804224..88807423) 3200 001010 > 9: [24496..26543]: 2639427584..2639429631 22 (88356864..88358911) 2048 000000 > 10: [26544..28591]: 2638981120..2638983167 22 (87910400..87912447) 2048 000000 > 11: [28592..30639]: 2638770176..2638772223 22 (87699456..87701503) 2048 000000 > 12: [30640..31279]: 2638247952..2638248591 22 (87177232..87177871) 640 001111 > 13: [31280..34719]: 2638248592..2638252031 22 (87177872..87181311) 3440 011010 > 14: [34720..65535]: hole 30816 > > The first thing I note is the initial allocations are just short of > 2MB and so the extent size hint is, indeed, being truncated here > according to contiguous free space limitations. I had thought that > should occur from reading the code, but it's complex and I wasn't > 100% certain what minimum allocation length would be used. > > Looking at the system batchlog files, I'm guessing the filesystem > ran out of contiguous 32MB free space extents some time around > September 25. The *Data.db files from 24 Sep and earlier then are > all nice 32MB extents, from 25 sep onwards they never make the full > 32MB (30-31MB max). eg, good: > > EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS > 0: [0..65535]: 350524552..350590087 3 (2651272..2716807) 65536 001111 > 1: [65536..131071]: 353378024..353443559 3 (5504744..5570279) 65536 001111 > 2: [131072..196607]: 355147016..355212551 3 (7273736..7339271) 65536 001111 > 3: [196608..262143]: 360029416..360094951 3 (12156136..12221671) 65536 001111 > 4: [262144..327679]: 362244144..362309679 3 (14370864..14436399) 65536 001111 > 5: [327680..343415]: 365809456..365825191 3 (17936176..17951911) 15736 001111 > > bad: > > EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS > 0: [0..64127]: 512855496..512919623 4 (49024456..49088583) 64128 001111 > 1: [64128..128247]: 266567048..266631167 2 (34651528..34715647) 64120 001010 > 2: [128248..142327]: 264401888..264415967 2 (32486368..32500447) 14080 001111 > So extent size is a hint but the extent alignment is a hard requirement. Since eventually the ENOSPC happened due to the alignment restriction, I think the alignment requirement should be made a hint too. > Hmmm - there's 2 million files in this filesystem. that is quite a > lot... > > Ok... I see where all the files are - there's a db that was > snapshotted every half hour going back to December 19 2017. There's > 55GB of snapshot data there 14362 snapshots holding in 1.8million > files. > > Ok, now I understand how the filesystem got into this mess. It has > nothing really to do with the filesystem allocator, geometry, extent > size hints, etc. It isn't really even an XFS specific problem - I > think most filesystems would be in trouble if you did this to them. Well, if you create snapshots and never delete them you'd run into a real ENOSPC sooner or later, so the main problem was lack of snapshot hygene. But it did trigger a premature ENOSPC due to the alignment restriction on those small files with hints (which I'm going to remove). > > First, let me demonstrate that the freespace fragmentation is caused > by these snapshots by removing them all: > > before: > from to extents blocks pct > 1 1 5916 5916 0.00 > 2 3 10235 22678 0.01 > 4 7 12251 66829 0.02 > 8 15 5521 59556 0.01 > 16 31 5703 132031 0.03 > 32 63 9754 463825 0.11 > 64 127 16742 1590339 0.37 > 128 255 1550511 390108625 89.87 > 256 511 71516 29178504 6.72 > 512 1023 19 15355 0.00 > 1024 2047 287 461824 0.11 > 2048 4095 528 1611413 0.37 > 4096 8191 1537 10352304 2.38 > 8192 16383 2 19015 0.00 > > Run a delete: > > for d in snapshots/*; do > rm -rf $d & > done > > <cranking along at ~12,000 write iops> > > # uptime > 17:41:08 up 22:07, 1 user, load average: 14293.17, 13840.37, 9517.14 > # > > 500,000 files removed: > from to extents blocks pct > 64 127 22564 2054234 0.47 > 128 255 900480 226428059 51.43 > 256 511 189904 91033237 20.68 > 512 1023 68304 54958788 12.48 > 1024 2047 25187 38284024 8.70 > 2048 4095 5508 15204528 3.45 > 4096 8191 1665 10999789 2.50 > 8192 16383 15 139424 0.03 > > 1m files removed: > from to extents blocks pct > 64 127 21940 1991685 0.45 > 128 255 536985 134731402 30.35 > 256 511 152092 73465972 16.55 > 512 1023 100471 82971130 18.69 > 1024 2047 48519 74016490 16.67 > 2048 4095 17272 49209538 11.09 > 4096 8191 4307 25135374 5.66 > 8192 16383 135 1254037 0.28 > > 1.5m files removed: > from to extents blocks pct > 64 127 9851 924782 0.20 > 128 255 227945 57079302 12.32 > 256 511 38723 18129086 3.91 > 512 1023 33547 28027554 6.05 > 1024 2047 31904 50171699 10.83 > 2048 4095 25263 75381887 16.27 > 4096 8191 16885 102836365 22.19 > 8192 16383 6367 68809645 14.85 > 16384 32767 1862 40183775 8.67 > 32768 65535 385 16228869 3.50 > 65536 131071 51 4213237 0.91 > 131072 262143 6 958528 0.21 > > after: > from to extents blocks pct > 128 255 154063 38785829 8.64 > 256 511 11037 4942114 1.10 > 512 1023 8576 6930035 1.54 > 1024 2047 8496 13464298 3.00 > 2048 4095 7664 23034455 5.13 > 4096 8191 8497 55217061 12.31 > 8192 16383 4233 45867691 10.22 > 16384 32767 1533 33488995 7.46 > 32768 65535 520 23924895 5.33 > 65536 131071 305 28675646 6.39 > 131072 262143 230 42411732 9.45 > 262144 524287 98 37213190 8.29 > 524288 1048575 41 29163579 6.50 > 1048576 2097151 27 40502889 9.03 > 2097152 4194303 5 14576157 3.25 > 4194304 8388607 2 10005670 2.23 > > Ok, so the results is not perfect, but there are now huge contiguous > free space extents available again - ~70% of the free space is now > contiguous extents >=32MB in length. There's every chance that the > fs would confinue to help reform large contiguous free spaces as the > database files come and go now, as long as the snapshot problem is > dealt with. > > So, what's the problem? Well, it's simply that the workload is > mixing data with vastly different temporal characteristics in the > same physical locality. Every half an hour, a set of ~100 smallish > files are written into a new directory which lands them at the low > endof the largest free space extent in that AG. Each new snapshot > directory ends up in a different AG, so it slowly spreads the > snapshots across all the AGs in the filesystem. Not exactly - those snapshots are hard links into the live database files, which eventually get removed. Usually, small files get removed early, but with the snapshots they get to live forever. > Each snapshot effective appends to the current working area in the > AG, chopping it out of the largest contiguous free space. By the > time the next snapshot in that AG comes around, there's other new > short term data between the old snapshot and the new one. The new > snapshot chops up the largest freespace, and on goes the cycle. > > Eventually the short term data between the snapshots gets removed, > but this doesn't reform large contiguous free spaces because the > snapshot data is in the way. And so this cycle continues with the > snapshot data chopping up the largest freespace extents in the > filesystem until there's not more large free space extents to be > found. > > The solution is to manage the snapshot data better. We need to keep > all the long term data physically isolated from the short term data > so they don't fragment free space. A short term application level > solution would require migrating the snapshot data out of the > filesystem to somewhere else and point to it with symlinks. Snapshots should not live forever on the disk. The procedure is to create a snapshot, copy it away, and then delete the snapshot. It's okay to let snapshots live for a while, but not all of them and not without a bound on their lifetime. The filesystem did have a role in this, by requiring alignment of the extent to the RAID stripe size. Now, given that this was a RAID with one member, alignment is pointless, but most of our deployments are to RAID arrays with >1 members, and alignment does save 12.5% of IOPS compared to un-aligned extents for compactions and writes (our scans/writes use 128k buffers, and the alignment is to 1MB). The database caused the problem by indirectly requiring 1MB alignment for files that are much smaller than 1MB, and the user contributed to the problem by causing millions of such small files to be kept. > > From the filesystem POV, I'm not sure that there is much we can do > about this directly - we have no idea what the lifetime of the data > is going to be.... > > <ding> > > Hold on.... > > <rummage in code> > > ....we already have an interface so setting those sorts of hints. > > fcntl(F_SET_RW_HINT, rw_hint) > > /* > * Valid hint values for F_{GET,SET}_RW_HINT. 0 is "not set", or can be > * used to clear any hints previously set. > */ > #define RWF_WRITE_LIFE_NOT_SET 0 > #define RWH_WRITE_LIFE_NONE 1 > #define RWH_WRITE_LIFE_SHORT 2 > #define RWH_WRITE_LIFE_MEDIUM 3 > #define RWH_WRITE_LIFE_LONG 4 > #define RWH_WRITE_LIFE_EXTREME 5 > > Avi, does this sound like something that you could use to > classify the different types of data the data base writes out? So long as the penalty for a mis-classification is not too large, we can for sure. Commitlog files have short lifespan, and so do newly born small data files. Those small data files are compacted into increasingly larger and long-lived files, and this information is known at the time of creation. Even without the filesystem altering its allocation according to the hint, this is still useful, since the disk will alter its internal allocation and maybe do something useful with it (as long as the filesystem passes the hint to the disk). > > I'll need to have a think about how to apply this to the allocator > policy algorithms before going any further, but I suspect making use > of this hint interface will allow us prevent interleaving of short > and long term data so avoid the freespace fragmentation it is > causing here.... IIUC, the problem (of having ENOSPC on a 10% used disk) is not fragmentation per se, it's the alignment requirement. To take it to extreme, a 1TB disk can only hold a million files if those files must be aligned to 1MB, even if everything is perfectly laid out. For sure fragmentation would have degraded performance sooner or later, but that's not as bad as that ENOSPC. I'm addressing the ENOSPC by removing the extent allocation hint on files that are known small (and increasing their application buffer sizes). In fact that will increase fragmentation as the filesystem will allocate on extent per buffer, rather than one extent for the entire file. But I think that, given that the extent size is treated as a hint (or so I infer from the fact that we have <32MB extents), so should the alignment. Perhaps allocation with a hint should be performed in two passes, first trying to match size and alignment, and second relaxing both restrictions. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: ENSOPC on a 10% used disk 2018-10-21 8:55 ` Avi Kivity @ 2018-10-21 14:28 ` Dave Chinner 2018-10-22 8:35 ` Avi Kivity 0 siblings, 1 reply; 26+ messages in thread From: Dave Chinner @ 2018-10-21 14:28 UTC (permalink / raw) To: Avi Kivity; +Cc: linux-xfs On Sun, Oct 21, 2018 at 11:55:47AM +0300, Avi Kivity wrote: > > On 19/10/2018 10.51, Dave Chinner wrote: > >On Thu, Oct 18, 2018 at 04:36:42PM +0300, Avi Kivity wrote: > >>On 18/10/2018 14.00, Avi Kivity wrote: > >>>>Can I get access to the metadump to dig around in the filesystem > >>>>directly so I can see how everything has ended up laid out? that > >>>>will help me work out what is actually occurring and determine if > >>>>mkfs/mount options can address the problem or whether deeper > >>>>allocator algorithm changes may be necessary.... > >>>I will ask permission to share the dump. > >>I'll send you a link privately. > >Thanks - I've started looking at this - the information here is > >just layout stuff - I'm omitted filenames and anything else that > >might be identifying from the output. > > > >Looking at a commit log file: > > > >stat.size = 33554432 > >stat.blocks = 34720 > >fsxattr.xflags = 0x800 [----------e-----] > >fsxattr.projid = 0 > >fsxattr.extsize = 33554432 > >fsxattr.cowextsize = 0 > >fsxattr.nextents = 14 > > > > > >and the layout: > > > >EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS > > 0: [0..4079]: 2646677520..2646681599 22 (95606800..95610879) 4080 001010 > > 1: [4080..8159]: 2643130384..2643134463 22 (92059664..92063743) 4080 001010 > > 2: [8160..12239]: 2642124816..2642128895 22 (91054096..91058175) 4080 001010 > > 3: [12240..16319]: 2640666640..2640670719 22 (89595920..89599999) 4080 001010 > > 4: [16320..18367]: 2640523264..2640525311 22 (89452544..89454591) 2048 000000 > > 5: [18368..20415]: 2640119808..2640121855 22 (89049088..89051135) 2048 000000 > > 6: [20416..21287]: 2639874064..2639874935 22 (88803344..88804215) 872 001111 > > 7: [21288..21295]: 2639874936..2639874943 22 (88804216..88804223) 8 011111 > > 8: [21296..24495]: 2639874944..2639878143 22 (88804224..88807423) 3200 001010 > > 9: [24496..26543]: 2639427584..2639429631 22 (88356864..88358911) 2048 000000 > > 10: [26544..28591]: 2638981120..2638983167 22 (87910400..87912447) 2048 000000 > > 11: [28592..30639]: 2638770176..2638772223 22 (87699456..87701503) 2048 000000 > > 12: [30640..31279]: 2638247952..2638248591 22 (87177232..87177871) 640 001111 > > 13: [31280..34719]: 2638248592..2638252031 22 (87177872..87181311) 3440 011010 > > 14: [34720..65535]: hole 30816 > > > >The first thing I note is the initial allocations are just short of > >2MB and so the extent size hint is, indeed, being truncated here > >according to contiguous free space limitations. I had thought that > >should occur from reading the code, but it's complex and I wasn't > >100% certain what minimum allocation length would be used. > > > >Looking at the system batchlog files, I'm guessing the filesystem > >ran out of contiguous 32MB free space extents some time around > >September 25. The *Data.db files from 24 Sep and earlier then are > >all nice 32MB extents, from 25 sep onwards they never make the full > >32MB (30-31MB max). eg, good: > > > > EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS > > 0: [0..65535]: 350524552..350590087 3 (2651272..2716807) 65536 001111 > > 1: [65536..131071]: 353378024..353443559 3 (5504744..5570279) 65536 001111 > > 2: [131072..196607]: 355147016..355212551 3 (7273736..7339271) 65536 001111 > > 3: [196608..262143]: 360029416..360094951 3 (12156136..12221671) 65536 001111 > > 4: [262144..327679]: 362244144..362309679 3 (14370864..14436399) 65536 001111 > > 5: [327680..343415]: 365809456..365825191 3 (17936176..17951911) 15736 001111 > > > >bad: > > > >EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS > > 0: [0..64127]: 512855496..512919623 4 (49024456..49088583) 64128 001111 > > 1: [64128..128247]: 266567048..266631167 2 (34651528..34715647) 64120 001010 > > 2: [128248..142327]: 264401888..264415967 2 (32486368..32500447) 14080 001111 > > > So extent size is a hint but the extent alignment is a hard > requirement. No, physical alignment is being ignored here, too. THose flags on the end? FLAG Values: 0100000 Shared extent 0010000 Unwritten preallocated extent 0001000 Doesn't begin on stripe unit 0000100 Doesn't end on stripe unit 0000010 Doesn't begin on stripe width 0000001 Doesn't end on stripe width When you have 001111, the allocation was completely unaligned. When you have 001010, the tail is stripe aligned When you ahve 000000, the head and tail are stripe aligned As you can see, there is a mix of aligned, tail aligned and completely unaligned extents. So, no, XFS is droping both size hints and alignment hints when it starts running out of aligned contiguous free space extents. > >Ok, so the results is not perfect, but there are now huge contiguous > >free space extents available again - ~70% of the free space is now > >contiguous extents >=32MB in length. There's every chance that the > >fs would confinue to help reform large contiguous free spaces as the > >database files come and go now, as long as the snapshot problem is > >dealt with. > > > >So, what's the problem? Well, it's simply that the workload is > >mixing data with vastly different temporal characteristics in the > >same physical locality. Every half an hour, a set of ~100 smallish > >files are written into a new directory which lands them at the low > >endof the largest free space extent in that AG. Each new snapshot > >directory ends up in a different AG, so it slowly spreads the > >snapshots across all the AGs in the filesystem. > > > Not exactly - those snapshots are hard links into the live database > files, which eventually get removed. Usually, small files get > removed early, but with the snapshots they get to live forever. They might be created as hard links, but the effect when the orginal database file links are removed is the same - the snapshotted data lives forever, interleaved amongst short term data. > >Each snapshot effective appends to the current working area in the > >AG, chopping it out of the largest contiguous free space. By the > >time the next snapshot in that AG comes around, there's other new > >short term data between the old snapshot and the new one. The new > >snapshot chops up the largest freespace, and on goes the cycle. > > > >Eventually the short term data between the snapshots gets removed, > >but this doesn't reform large contiguous free spaces because the > >snapshot data is in the way. And so this cycle continues with the > >snapshot data chopping up the largest freespace extents in the > >filesystem until there's not more large free space extents to be > >found. > > > >The solution is to manage the snapshot data better. We need to keep > >all the long term data physically isolated from the short term data > >so they don't fragment free space. A short term application level > >solution would require migrating the snapshot data out of the > >filesystem to somewhere else and point to it with symlinks. > > > Snapshots should not live forever on the disk. The procedure is to > create a snapshot, copy it away, and then delete the snapshot. It's > okay to let snapshots live for a while, but not all of them and not > without a bound on their lifetime. > > > The filesystem did have a role in this, by requiring alignment of > the extent to the RAID stripe size. No, in the end it didn't. > Now, given that this was a RAID > with one member, alignment is pointless, but most of our deployments > are to RAID arrays with >1 members, and alignment does save 12.5% of > IOPS compared to un-aligned extents for compactions and writes (our > scans/writes use 128k buffers, and the alignment is to 1MB). The > database caused the problem by indirectly requiring 1MB alignment > for files that are much smaller than 1MB, and the user contributed > to the problem by causing millions of such small files to be kept. *nod* > > > ><ding> > > > >Hold on.... > > > ><rummage in code> > > > >....we already have an interface so setting those sorts of hints. > > > >fcntl(F_SET_RW_HINT, rw_hint) > > > >/* > > * Valid hint values for F_{GET,SET}_RW_HINT. 0 is "not set", or can be > > * used to clear any hints previously set. > > */ > >#define RWF_WRITE_LIFE_NOT_SET 0 > >#define RWH_WRITE_LIFE_NONE 1 > >#define RWH_WRITE_LIFE_SHORT 2 > >#define RWH_WRITE_LIFE_MEDIUM 3 > >#define RWH_WRITE_LIFE_LONG 4 > >#define RWH_WRITE_LIFE_EXTREME 5 > > > >Avi, does this sound like something that you could use to > >classify the different types of data the data base writes out? > > > So long as the penalty for a mis-classification is not too large, we > can for sure. OK. > >I'll need to have a think about how to apply this to the allocator > >policy algorithms before going any further, but I suspect making use > >of this hint interface will allow us prevent interleaving of short > >and long term data so avoid the freespace fragmentation it is > >causing here.... > > > IIUC, the problem (of having ENOSPC on a 10% used disk) is not > fragmentation per se, it's the alignment requirement. Which, as I've noted above, alignment is a hint, not a requirement. > To take it to > extreme, a 1TB disk can only hold a million files if those files > must be aligned to 1MB, even if everything is perfectly laid out. > For sure fragmentation would have degraded performance sooner or > later, but that's not as bad as that ENOSPC. What it comes down to is that having looked into it, I don't know why that ENOSPC error occurred. Alignment didn't cause it because alignment was being dropped - that just caused free space fragmentation. Extent size hints didn't cause it because the size hints were dropped - that just caused freespace fragmentation. A lack of free space didn't cause it, because there was heaps of free space in all allocation groups. But something tickled a corner case that triggered an allocation failure that was interpretted as ENOSPC rather than retrying the allocation. Until I can reproduce the ENOSPC allocation failure (and I tried!) then it'll be a mystery as to what caused it. > entire file. But I think that, given that the extent size is treated > as a hint (or so I infer from the fact that we have <32MB extents), > so should the alignment. Perhaps allocation with a hint should be > performed in two passes, first trying to match size and alignment, > and second relaxing both restrictions. I think I already mentioned there were 5 separate attmepts to allocate, each failure reducing restrictions: 1. extent sized and contiguous to adjacent block in file 2. extent sized and aligned, at higher block in AG 3. extent sized, not aligned, at higher block in AG 4. >= minimum length, not aligned, anywhere in AG >= target AG 5. minimum length, not aligned, in any AG Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: ENSOPC on a 10% used disk 2018-10-21 14:28 ` Dave Chinner @ 2018-10-22 8:35 ` Avi Kivity 2018-10-22 9:52 ` Dave Chinner 0 siblings, 1 reply; 26+ messages in thread From: Avi Kivity @ 2018-10-22 8:35 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs On 21/10/2018 17.28, Dave Chinner wrote: > On Sun, Oct 21, 2018 at 11:55:47AM +0300, Avi Kivity wrote: >> On 19/10/2018 10.51, Dave Chinner wrote: >>> On Thu, Oct 18, 2018 at 04:36:42PM +0300, Avi Kivity wrote: >>>> On 18/10/2018 14.00, Avi Kivity wrote: >>>>>> Can I get access to the metadump to dig around in the filesystem >>>>>> directly so I can see how everything has ended up laid out? that >>>>>> will help me work out what is actually occurring and determine if >>>>>> mkfs/mount options can address the problem or whether deeper >>>>>> allocator algorithm changes may be necessary.... >>>>> I will ask permission to share the dump. >>>> I'll send you a link privately. >>> Thanks - I've started looking at this - the information here is >>> just layout stuff - I'm omitted filenames and anything else that >>> might be identifying from the output. >>> >>> Looking at a commit log file: >>> >>> stat.size = 33554432 >>> stat.blocks = 34720 >>> fsxattr.xflags = 0x800 [----------e-----] >>> fsxattr.projid = 0 >>> fsxattr.extsize = 33554432 >>> fsxattr.cowextsize = 0 >>> fsxattr.nextents = 14 >>> >>> >>> and the layout: >>> >>> EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS >>> 0: [0..4079]: 2646677520..2646681599 22 (95606800..95610879) 4080 001010 >>> 1: [4080..8159]: 2643130384..2643134463 22 (92059664..92063743) 4080 001010 >>> 2: [8160..12239]: 2642124816..2642128895 22 (91054096..91058175) 4080 001010 >>> 3: [12240..16319]: 2640666640..2640670719 22 (89595920..89599999) 4080 001010 >>> 4: [16320..18367]: 2640523264..2640525311 22 (89452544..89454591) 2048 000000 >>> 5: [18368..20415]: 2640119808..2640121855 22 (89049088..89051135) 2048 000000 >>> 6: [20416..21287]: 2639874064..2639874935 22 (88803344..88804215) 872 001111 >>> 7: [21288..21295]: 2639874936..2639874943 22 (88804216..88804223) 8 011111 >>> 8: [21296..24495]: 2639874944..2639878143 22 (88804224..88807423) 3200 001010 >>> 9: [24496..26543]: 2639427584..2639429631 22 (88356864..88358911) 2048 000000 >>> 10: [26544..28591]: 2638981120..2638983167 22 (87910400..87912447) 2048 000000 >>> 11: [28592..30639]: 2638770176..2638772223 22 (87699456..87701503) 2048 000000 >>> 12: [30640..31279]: 2638247952..2638248591 22 (87177232..87177871) 640 001111 >>> 13: [31280..34719]: 2638248592..2638252031 22 (87177872..87181311) 3440 011010 >>> 14: [34720..65535]: hole 30816 >>> >>> The first thing I note is the initial allocations are just short of >>> 2MB and so the extent size hint is, indeed, being truncated here >>> according to contiguous free space limitations. I had thought that >>> should occur from reading the code, but it's complex and I wasn't >>> 100% certain what minimum allocation length would be used. >>> >>> Looking at the system batchlog files, I'm guessing the filesystem >>> ran out of contiguous 32MB free space extents some time around >>> September 25. The *Data.db files from 24 Sep and earlier then are >>> all nice 32MB extents, from 25 sep onwards they never make the full >>> 32MB (30-31MB max). eg, good: >>> >>> EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS >>> 0: [0..65535]: 350524552..350590087 3 (2651272..2716807) 65536 001111 >>> 1: [65536..131071]: 353378024..353443559 3 (5504744..5570279) 65536 001111 >>> 2: [131072..196607]: 355147016..355212551 3 (7273736..7339271) 65536 001111 >>> 3: [196608..262143]: 360029416..360094951 3 (12156136..12221671) 65536 001111 >>> 4: [262144..327679]: 362244144..362309679 3 (14370864..14436399) 65536 001111 >>> 5: [327680..343415]: 365809456..365825191 3 (17936176..17951911) 15736 001111 >>> >>> bad: >>> >>> EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS >>> 0: [0..64127]: 512855496..512919623 4 (49024456..49088583) 64128 001111 >>> 1: [64128..128247]: 266567048..266631167 2 (34651528..34715647) 64120 001010 >>> 2: [128248..142327]: 264401888..264415967 2 (32486368..32500447) 14080 001111 >> >> So extent size is a hint but the extent alignment is a hard >> requirement. > No, physical alignment is being ignored here, too. THose flags on > the end? > > FLAG Values: > 0100000 Shared extent > 0010000 Unwritten preallocated extent > 0001000 Doesn't begin on stripe unit > 0000100 Doesn't end on stripe unit > 0000010 Doesn't begin on stripe width > 0000001 Doesn't end on stripe width > > When you have 001111, the allocation was completely unaligned. > When you have 001010, the tail is stripe aligned > When you ahve 000000, the head and tail are stripe aligned > > As you can see, there is a mix of aligned, tail aligned and > completely unaligned extents. > > So, no, XFS is droping both size hints and alignment hints when > it starts running out of aligned contiguous free space extents. You are right; I searched for and found some files that where head-aligned, and jumped to conclusions, but there are many that are not. Those head-aligned files probably belonged to an era in that filesystem's life where head-aligned extents less than 1MB were available. >>> Ok, so the results is not perfect, but there are now huge contiguous >>> free space extents available again - ~70% of the free space is now >>> contiguous extents >=32MB in length. There's every chance that the >>> fs would confinue to help reform large contiguous free spaces as the >>> database files come and go now, as long as the snapshot problem is >>> dealt with. >>> >>> So, what's the problem? Well, it's simply that the workload is >>> mixing data with vastly different temporal characteristics in the >>> same physical locality. Every half an hour, a set of ~100 smallish >>> files are written into a new directory which lands them at the low >>> endof the largest free space extent in that AG. Each new snapshot >>> directory ends up in a different AG, so it slowly spreads the >>> snapshots across all the AGs in the filesystem. >> >> Not exactly - those snapshots are hard links into the live database >> files, which eventually get removed. Usually, small files get >> removed early, but with the snapshots they get to live forever. > They might be created as hard links, but the effect when the > orginal database file links are removed is the same - the snapshotted > data lives forever, interleaved amongst short term data. Yes. >>> Each snapshot effective appends to the current working area in the >>> AG, chopping it out of the largest contiguous free space. By the >>> time the next snapshot in that AG comes around, there's other new >>> short term data between the old snapshot and the new one. The new >>> snapshot chops up the largest freespace, and on goes the cycle. >>> >>> Eventually the short term data between the snapshots gets removed, >>> but this doesn't reform large contiguous free spaces because the >>> snapshot data is in the way. And so this cycle continues with the >>> snapshot data chopping up the largest freespace extents in the >>> filesystem until there's not more large free space extents to be >>> found. >>> >>> The solution is to manage the snapshot data better. We need to keep >>> all the long term data physically isolated from the short term data >>> so they don't fragment free space. A short term application level >>> solution would require migrating the snapshot data out of the >>> filesystem to somewhere else and point to it with symlinks. >> >> Snapshots should not live forever on the disk. The procedure is to >> create a snapshot, copy it away, and then delete the snapshot. It's >> okay to let snapshots live for a while, but not all of them and not >> without a bound on their lifetime. >> >> >> The filesystem did have a role in this, by requiring alignment of >> the extent to the RAID stripe size. > No, in the end it didn't. Right. > >> Now, given that this was a RAID >> with one member, alignment is pointless, but most of our deployments >> are to RAID arrays with >1 members, and alignment does save 12.5% of >> IOPS compared to un-aligned extents for compactions and writes (our >> scans/writes use 128k buffers, and the alignment is to 1MB). The >> database caused the problem by indirectly requiring 1MB alignment >> for files that are much smaller than 1MB, and the user contributed >> to the problem by causing millions of such small files to be kept. > *nod* > >>> <ding> >>> >>> Hold on.... >>> >>> <rummage in code> >>> >>> ....we already have an interface so setting those sorts of hints. >>> >>> fcntl(F_SET_RW_HINT, rw_hint) >>> >>> /* >>> * Valid hint values for F_{GET,SET}_RW_HINT. 0 is "not set", or can be >>> * used to clear any hints previously set. >>> */ >>> #define RWF_WRITE_LIFE_NOT_SET 0 >>> #define RWH_WRITE_LIFE_NONE 1 >>> #define RWH_WRITE_LIFE_SHORT 2 >>> #define RWH_WRITE_LIFE_MEDIUM 3 >>> #define RWH_WRITE_LIFE_LONG 4 >>> #define RWH_WRITE_LIFE_EXTREME 5 >>> >>> Avi, does this sound like something that you could use to >>> classify the different types of data the data base writes out? >> >> So long as the penalty for a mis-classification is not too large, we >> can for sure. > OK. > >>> I'll need to have a think about how to apply this to the allocator >>> policy algorithms before going any further, but I suspect making use >>> of this hint interface will allow us prevent interleaving of short >>> and long term data so avoid the freespace fragmentation it is >>> causing here.... >> >> IIUC, the problem (of having ENOSPC on a 10% used disk) is not >> fragmentation per se, it's the alignment requirement. > Which, as I've noted above, alignment is a hint, not a requirement. > >> To take it to >> extreme, a 1TB disk can only hold a million files if those files >> must be aligned to 1MB, even if everything is perfectly laid out. >> For sure fragmentation would have degraded performance sooner or >> later, but that's not as bad as that ENOSPC. > What it comes down to is that having looked into it, I don't know > why that ENOSPC error occurred. > > Alignment didn't cause it because alignment was being dropped - that > just caused free space fragmentation. Extent size hints didn't > cause it because the size hints were dropped - that just caused > freespace fragmentation. A lack of free space > didn't cause it, because there was heaps of free space in all > allocation groups. > > But something tickled a corner case that triggered an allocation > failure that was interpretted as ENOSPC rather than retrying the > allocation. Until I can reproduce the ENOSPC allocation failure > (and I tried!) then it'll be a mystery as to what caused it. The user reported the error happening multiple times, taking many hours to reproduce, but on more than one node. So it's an obscure corner case but not obscure enough to be a one-off event. I've asked the user to regularly trim their snapshots (they we're not aware of the snapshots actually - they were performed as a side effect of a TRUNCATE operation), and we'll remove the default extent hint for small files. I'll also consider noalign - the 12.5% reduction in IOPS is perhaps not worth the fragmentation it generates. > >> entire file. But I think that, given that the extent size is treated >> as a hint (or so I infer from the fact that we have <32MB extents), >> so should the alignment. Perhaps allocation with a hint should be >> performed in two passes, first trying to match size and alignment, >> and second relaxing both restrictions. > I think I already mentioned there were 5 separate attmepts to > allocate, each failure reducing restrictions: > > 1. extent sized and contiguous to adjacent block in file > 2. extent sized and aligned, at higher block in AG > 3. extent sized, not aligned, at higher block in AG > 4. >= minimum length, not aligned, anywhere in AG >= target AG Surprised at this one. Won't it skew usage in high AGs? Perhaps it's rare enough not to matter. Perhaps those higher-block/higher-AG heuristics can be improved for non-rotational media. > 5. minimum length, not aligned, in any AG Thanks for your patience in helping me understand this issue. Avi > Cheers, > > Dave. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: ENSOPC on a 10% used disk 2018-10-22 8:35 ` Avi Kivity @ 2018-10-22 9:52 ` Dave Chinner 0 siblings, 0 replies; 26+ messages in thread From: Dave Chinner @ 2018-10-22 9:52 UTC (permalink / raw) To: Avi Kivity; +Cc: linux-xfs On Mon, Oct 22, 2018 at 11:35:26AM +0300, Avi Kivity wrote: > > On 21/10/2018 17.28, Dave Chinner wrote: > >On Sun, Oct 21, 2018 at 11:55:47AM +0300, Avi Kivity wrote: > >>For sure fragmentation would have degraded performance sooner or > >>later, but that's not as bad as that ENOSPC. > >What it comes down to is that having looked into it, I don't know > >why that ENOSPC error occurred. > > > >Alignment didn't cause it because alignment was being dropped - that > >just caused free space fragmentation. Extent size hints didn't > >cause it because the size hints were dropped - that just caused > >freespace fragmentation. A lack of free space > >didn't cause it, because there was heaps of free space in all > >allocation groups. > > > >But something tickled a corner case that triggered an allocation > >failure that was interpretted as ENOSPC rather than retrying the > >allocation. Until I can reproduce the ENOSPC allocation failure > >(and I tried!) then it'll be a mystery as to what caused it. > > > The user reported the error happening multiple times, taking many > hours to reproduce, but on more than one node. So it's an obscure > corner case but not obscure enough to be a one-off event. Yeah, as with all these sorts of things, the difficulty is in reproducing it. I'll have a look through some of the higher level code during the week to see if there's a min/max len condition I missed somewhere that might lead to failure instead of a retry. Because it shouldn't really fail at all because in the end a single block allocation is allowable for normal extent size w/ alignemnt allocation and there is heaps of free available. > >>entire file. But I think that, given that the extent size is treated > >>as a hint (or so I infer from the fact that we have <32MB extents), > >>so should the alignment. Perhaps allocation with a hint should be > >>performed in two passes, first trying to match size and alignment, > >>and second relaxing both restrictions. > >I think I already mentioned there were 5 separate attmepts to > >allocate, each failure reducing restrictions: > > > >1. extent sized and contiguous to adjacent block in file > >2. extent sized and aligned, at higher block in AG > >3. extent sized, not aligned, at higher block in AG > >4. >= minimum length, not aligned, anywhere in AG >= target AG > > > Surprised at this one. Won't it skew usage in high AGs? It's a constraint based on AG locking order. We always lock in ascending AG order, so if we've locked AG 4 and modified the free list in preparation for allocation, then failed to find an aligned extent, that will remain locked until we finish the allocation process and hence we can't lock AGs <= AG 4 otherwise we risk deadlocking the allocator..... > Perhaps it's rare enough not to matter. It tends to be reare because we chose the ag ahead of time to ensure that the majority of the time there is space available. > Thanks for your patience in helping me understand this issue. No worries, what I'm here for :) Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: ENSOPC on a 10% used disk 2018-10-18 11:00 ` Avi Kivity 2018-10-18 13:36 ` Avi Kivity @ 2018-10-18 15:44 ` Avi Kivity 2018-10-18 16:11 ` Avi Kivity 2018-10-19 1:24 ` Dave Chinner 2018-10-19 1:15 ` Dave Chinner 2 siblings, 2 replies; 26+ messages in thread From: Avi Kivity @ 2018-10-18 15:44 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs On 18/10/2018 14.00, Avi Kivity wrote: > > > This can happen, and indeed I see our default hint is 1MB, so our > small files use a 1MB hint. Looks like we should remove that 1MB hint > since it's reducing allocation flexibility for XFS without a good return. I convinced myself that this is the root cause, it fits perfectly with your explanation. I still think that XFS should allocate *something* rather than ENOSPC, but I can also understand someone wanting a guarantee. > On the other hand, I worry that because we bypass the page cache, XFS > doesn't get to see the entire file at one time and so it will get > fragmented. That's what happens. I write 1000 4k writes to 400 files, in parallel, AIO+DIO. I got 400 perfectly-fragmented files, each had 1000 extents. So I'll remove the default hint for small files, and replace it with larger buffer sizes so we batch more and don't get 8k-sized extents (which is our default buffer size). > > > Suppose I write a 4k file with a 1MB hint. How is that trailing > (1MB-4k) marked? Free extent, free extent with extra annotation, or > allocated extent? We may need to deallocate those extents? (will > FALLOC_FL_PUNCH_HOLE do the trick?) > I found an 11-year-old post from you that says those reservations are freed on close: https://linux-xfs.oss.sgi.narkive.com/Bpctu4DN/reducing-memory-requirements-for-high-extent-xfs-files#post6 This is consistent with xfs_db reporting those areas are free. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: ENSOPC on a 10% used disk 2018-10-18 15:44 ` Avi Kivity @ 2018-10-18 16:11 ` Avi Kivity 2018-10-19 1:24 ` Dave Chinner 1 sibling, 0 replies; 26+ messages in thread From: Avi Kivity @ 2018-10-18 16:11 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs On 18/10/2018 18.44, Avi Kivity wrote: > > On 18/10/2018 14.00, Avi Kivity wrote: >> >> >> This can happen, and indeed I see our default hint is 1MB, so our >> small files use a 1MB hint. Looks like we should remove that 1MB hint >> since it's reducing allocation flexibility for XFS without a good >> return. > > > I convinced myself that this is the root cause, it fits perfectly with > your explanation. I still think that XFS should allocate *something* > rather than ENOSPC, but I can also understand someone wanting a > guarantee. > A small twist: there were in fact lots of small files on that system, caused by snapshots that the user did not remove. But I think the explanation still holds. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: ENSOPC on a 10% used disk 2018-10-18 15:44 ` Avi Kivity 2018-10-18 16:11 ` Avi Kivity @ 2018-10-19 1:24 ` Dave Chinner 2018-10-21 9:00 ` Avi Kivity 1 sibling, 1 reply; 26+ messages in thread From: Dave Chinner @ 2018-10-19 1:24 UTC (permalink / raw) To: Avi Kivity; +Cc: linux-xfs On Thu, Oct 18, 2018 at 06:44:54PM +0300, Avi Kivity wrote: > > On 18/10/2018 14.00, Avi Kivity wrote: > > > > > >This can happen, and indeed I see our default hint is 1MB, so our > >small files use a 1MB hint. Looks like we should remove that 1MB > >hint since it's reducing allocation flexibility for XFS without a > >good return. > > > I convinced myself that this is the root cause, it fits perfectly > with your explanation. I still think that XFS should allocate > *something* rather than ENOSPC, but I can also understand someone > wanting a guarantee. Yup, it's a classic catch 22. > >On the other hand, I worry that because we bypass the page cache, > >XFS doesn't get to see the entire file at one time and so it will > >get fragmented. > > > That's what happens. I write 1000 4k writes to 400 files, in > parallel, AIO+DIO. I got 400 perfectly-fragmented files, each had > 1000 extents. Yup, you wrote them all in the one directory, didn't you? :) > So I'll remove the default hint for small files, and replace it with > larger buffer sizes so we batch more and don't get 8k-sized extents > (which is our default buffer size). Or you could just mount with the "noalign" mount option to turn off stripe alignment. After all, you don't need stripe alignment for a single spindle.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: ENSOPC on a 10% used disk 2018-10-19 1:24 ` Dave Chinner @ 2018-10-21 9:00 ` Avi Kivity 2018-10-21 14:34 ` Dave Chinner 0 siblings, 1 reply; 26+ messages in thread From: Avi Kivity @ 2018-10-21 9:00 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs On 19/10/2018 04.24, Dave Chinner wrote: > On Thu, Oct 18, 2018 at 06:44:54PM +0300, Avi Kivity wrote: >> On 18/10/2018 14.00, Avi Kivity wrote: >>> >>> This can happen, and indeed I see our default hint is 1MB, so our >>> small files use a 1MB hint. Looks like we should remove that 1MB >>> hint since it's reducing allocation flexibility for XFS without a >>> good return. >> >> I convinced myself that this is the root cause, it fits perfectly >> with your explanation. I still think that XFS should allocate >> *something* rather than ENOSPC, but I can also understand someone >> wanting a guarantee. > Yup, it's a classic catch 22. > >>> On the other hand, I worry that because we bypass the page cache, >>> XFS doesn't get to see the entire file at one time and so it will >>> get fragmented. >> >> That's what happens. I write 1000 4k writes to 400 files, in >> parallel, AIO+DIO. I got 400 perfectly-fragmented files, each had >> 1000 extents. > Yup, you wrote them all in the one directory, didn't you? :) Yes :( But if I have more concurrently-written files than AGs, I'd get the same behavior with multiple directories, no? >> So I'll remove the default hint for small files, and replace it with >> larger buffer sizes so we batch more and don't get 8k-sized extents >> (which is our default buffer size). > Or you could just mount with the "noalign" mount option to turn off > stripe alignment. After all, you don't need stripe alignment for a > single spindle.... For a single spindle, sure. But most deployments have multiple spindles. Since these aren't real spindles, the advantages of alignment are not as great, but they still exist. The files are written with aligned offsets, and some of the reads are also aligned, so it saves IOPS whenever we cross an alignment boundary. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: ENSOPC on a 10% used disk 2018-10-21 9:00 ` Avi Kivity @ 2018-10-21 14:34 ` Dave Chinner 0 siblings, 0 replies; 26+ messages in thread From: Dave Chinner @ 2018-10-21 14:34 UTC (permalink / raw) To: Avi Kivity; +Cc: linux-xfs On Sun, Oct 21, 2018 at 12:00:16PM +0300, Avi Kivity wrote: > > On 19/10/2018 04.24, Dave Chinner wrote: > >On Thu, Oct 18, 2018 at 06:44:54PM +0300, Avi Kivity wrote: > >>On 18/10/2018 14.00, Avi Kivity wrote: > >>> > >>>This can happen, and indeed I see our default hint is 1MB, so our > >>>small files use a 1MB hint. Looks like we should remove that 1MB > >>>hint since it's reducing allocation flexibility for XFS without a > >>>good return. > >> > >>I convinced myself that this is the root cause, it fits perfectly > >>with your explanation. I still think that XFS should allocate > >>*something* rather than ENOSPC, but I can also understand someone > >>wanting a guarantee. > >Yup, it's a classic catch 22. > > > >>>On the other hand, I worry that because we bypass the page cache, > >>>XFS doesn't get to see the entire file at one time and so it will > >>>get fragmented. > >> > >>That's what happens. I write 1000 4k writes to 400 files, in > >>parallel, AIO+DIO. I got 400 perfectly-fragmented files, each had > >>1000 extents. > >Yup, you wrote them all in the one directory, didn't you? :) > > > Yes :( > > But if I have more concurrently-written files than AGs, I'd get the > same behavior with multiple directories, no? Up to a point. At which point, I'd say you're doing it wrong and tell you to use extent size hints or buffered IO so the filesystem can turn the small random writes in nicely formed large IOs via delayed allocation. :) Remember the first rule of storage: Garbage In, Garbage Out. With direct IO, it's the responsibility of the application to give the fileystem and storage layers well formed IOs. If the app doesn't play nice, there's nothing the filesystem or storage layers can do to make it better.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: ENSOPC on a 10% used disk 2018-10-18 11:00 ` Avi Kivity 2018-10-18 13:36 ` Avi Kivity 2018-10-18 15:44 ` Avi Kivity @ 2018-10-19 1:15 ` Dave Chinner 2018-10-21 9:21 ` Avi Kivity 2 siblings, 1 reply; 26+ messages in thread From: Dave Chinner @ 2018-10-19 1:15 UTC (permalink / raw) To: Avi Kivity; +Cc: linux-xfs On Thu, Oct 18, 2018 at 02:00:19PM +0300, Avi Kivity wrote: > On 18/10/2018 13.05, Dave Chinner wrote: > >On Thu, Oct 18, 2018 at 10:55:18AM +0300, Avi Kivity wrote: > >>On 18/10/2018 04.37, Dave Chinner wrote: > >>>On Wed, Oct 17, 2018 at 10:52:48AM +0300, Avi Kivity wrote: > >>>>I have a user running a 1.7TB filesystem with ~10% usage (as shown > >>>>by df), getting sporadic ENOSPC errors. The disk is mounted with > >>>>inode64 and has a relatively small number of large files. The disk > >>>>is a single-member RAID0 array, with 1MB chunk size. There are 32 > >Ok, now I need to know what "single member RAID0 array" means, > >becuase this is clearly related to allocation alignment and I need > >to know why the FS was configured the way it was. > > > It's a Linux RAID device, /dev/md0. > > > We configure it this way so that it's easy to add storage (okay, the > real reason is probably to avoid special casing one drive). As a stripe? That requires resilvering to expand, which is a slow, messy operation. There's also been too many horror stories about crashes during rsilvering causing unrecoverable corruptions for my liking... > One disk, organized into a Linux RAID device with just one member. So there's no realy need for IO alignment at all. Unaligned writes to RAID0 don't require RMW cycles, so alignment is really onl used to avoid hotspotting a disk in the stripe. Which isn't an issue here, either. > >>meta-data=/dev/loop2 isize=512 agcount=32, agsize=14494720 blks > >> = sectsz=512 attr=2, projid32bit=1 > >> = crc=1 finobt=0 spinodes=0 rmapbt=0 > >> = reflink=0 > >>data = bsize=4096 blocks=463831040, imaxpct=5 > >> = sunit=256 swidth=256 blks > >sunit=swidth is unusual for a RAID0 array, unless it's hardware RAID > >and the array only reports one number to mkfs. What this chosen by > >mkfs, or specifically configured by the user? If specifically > >configured, why? > > > I'm guessing it's because it has one member? I'm guessing the usual > is swidth=sunit*nmembers? *nod*. Which is unusual for a RAID0 device. > >What is important is that it means aligned allocations will be used > >for any allocation that is over sunit (1MB) and that's where all the > >problems seem to come from. > > Do these aligned allocations not fall back to non-aligned > allocations if they fail? They do, but extent size hints change the fallback behaviour... > >See how we lost a large aligned 2MB freespace @ 9 when the small > >file "nn" was laid down? repeat this fill and free pattern over and > >over again, and eventually it fragments the free space until there's > >no large contiguous free spaces left, and large aligned extents can > >no longer be allocated. > > > >For this to trigger you need the small files to be larger than 1 > >stripe unit, but still much smaller than the extent size hint, and > >the small files need to hang around as the large files come and go. > > > This can happen, and indeed I see our default hint is 1MB, so our > small files use a 1MB hint. Ok, which forces all allocations to be at least stripe unit (1MB) aligned. > > Looks like we should remove that 1MB > hint since it's reducing allocation flexibility for XFS without a > good return. On the other hand, I worry that because we bypass the > page cache, XFS doesn't get to see the entire file at one time and > so it will get fragmented. Yes. Your other option is to use an extent size hint that is smaller than the sunit. That should not align to 1MB because the initial data allocation size is not large enough to trigger stripe alignment. > Suppose I write a 4k file with a 1MB hint. How is that trailing > (1MB-4k) marked? Free extent, free extent with extra annotation, or > allocated extent? We may need to deallocate those extents? (will > FALLOC_FL_PUNCH_HOLE do the trick?) It's an unwritten extent beyond EOF, and how that is treated when the file is last closed depends on how that extent was allocated. But, yes, punching the range beyond EOF will definitely free it. > >>>>Is this a known issue? > >The effect and symptom is - it's a generic large aligned extent vs small unaligned extent > >issue, but I've never seen it manifest in a user workload outside of > >a very constrained multistream realtime video ingest/playout > >workload (i.e. the workload the filestreams allocator was written > >for). And before you ask, no, the filestreams allocator does not > >solve this problem. > > > >The most common manifestation of this problem has been inode > >allocation on filesystems full of small files - inodes are allocated > >in large aligned extents compared to small files, and so eventually > >the filesystem runs out of large contigouous freespace and inodes > >can't be allocated. The sparse inodes mkfs option fixed this by > >allowing inodes to be allocated as sparse chunks so they could > >interleave into any free space available.... > > Shouldn't XFS fall back to a non-aligned allocation rather that > returning ENOSPC on a filesystem with 90% free space? The filesystem does fall back to unaligned allocation - there's ~5 spearate, progressively less strict allocation attempts on failure. The problem is that the extent size hint is asking to allocate a contiguous 32MB extent and there's no contiguous 32MB free space extent available, aligned or not. That's what I think is generating the ENOSPC error, but it's not clear to me from the code whether it is supposed to ignore the extent size hint on failure and allocate a set of shorter unaligned extents or not.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: ENSOPC on a 10% used disk 2018-10-19 1:15 ` Dave Chinner @ 2018-10-21 9:21 ` Avi Kivity 2018-10-21 15:06 ` Dave Chinner 0 siblings, 1 reply; 26+ messages in thread From: Avi Kivity @ 2018-10-21 9:21 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs On 19/10/2018 04.15, Dave Chinner wrote: > On Thu, Oct 18, 2018 at 02:00:19PM +0300, Avi Kivity wrote: >> On 18/10/2018 13.05, Dave Chinner wrote: >>> On Thu, Oct 18, 2018 at 10:55:18AM +0300, Avi Kivity wrote: >>>> On 18/10/2018 04.37, Dave Chinner wrote: >>>>> On Wed, Oct 17, 2018 at 10:52:48AM +0300, Avi Kivity wrote: >>>>>> I have a user running a 1.7TB filesystem with ~10% usage (as shown >>>>>> by df), getting sporadic ENOSPC errors. The disk is mounted with >>>>>> inode64 and has a relatively small number of large files. The disk >>>>>> is a single-member RAID0 array, with 1MB chunk size. There are 32 >>> Ok, now I need to know what "single member RAID0 array" means, >>> becuase this is clearly related to allocation alignment and I need >>> to know why the FS was configured the way it was. >> >> It's a Linux RAID device, /dev/md0. >> >> >> We configure it this way so that it's easy to add storage (okay, the >> real reason is probably to avoid special casing one drive). > As a stripe? That requires resilvering to expand, which is a slow, > messy operation. There's also been too many horror stories about > crashes during rsilvering causing unrecoverable corruptions for my > liking... Like I said, the real reason is to avoid a special case for one disk. I don't think we, or one of our users, ever expanded a RAID array in this way. > >> One disk, organized into a Linux RAID device with just one member. > So there's no realy need for IO alignment at all. Unaligned writes > to RAID0 don't require RMW cycles, so alignment is really onl used > to avoid hotspotting a disk in the stripe. Which isn't an issue > here, either. It does help (for >1 member arrays) in avoiding a logically aligned read or write to be split into two ops targeting two disks. >>>> meta-data=/dev/loop2 isize=512 agcount=32, agsize=14494720 blks >>>> = sectsz=512 attr=2, projid32bit=1 >>>> = crc=1 finobt=0 spinodes=0 rmapbt=0 >>>> = reflink=0 >>>> data = bsize=4096 blocks=463831040, imaxpct=5 >>>> = sunit=256 swidth=256 blks >>> sunit=swidth is unusual for a RAID0 array, unless it's hardware RAID >>> and the array only reports one number to mkfs. What this chosen by >>> mkfs, or specifically configured by the user? If specifically >>> configured, why? >> >> I'm guessing it's because it has one member? I'm guessing the usual >> is swidth=sunit*nmembers? > *nod*. Which is unusual for a RAID0 device. > >>> What is important is that it means aligned allocations will be used >>> for any allocation that is over sunit (1MB) and that's where all the >>> problems seem to come from. >> Do these aligned allocations not fall back to non-aligned >> allocations if they fail? > They do, but extent size hints change the fallback behaviour... > >>> See how we lost a large aligned 2MB freespace @ 9 when the small >>> file "nn" was laid down? repeat this fill and free pattern over and >>> over again, and eventually it fragments the free space until there's >>> no large contiguous free spaces left, and large aligned extents can >>> no longer be allocated. >>> >>> For this to trigger you need the small files to be larger than 1 >>> stripe unit, but still much smaller than the extent size hint, and >>> the small files need to hang around as the large files come and go. >> >> This can happen, and indeed I see our default hint is 1MB, so our >> small files use a 1MB hint. > Ok, which forces all allocations to be at least stripe unit (1MB) > aligned. If the hint were smaller than the stripe unit, would it remove the alignment requirement? I see you answered below. >> Looks like we should remove that 1MB >> hint since it's reducing allocation flexibility for XFS without a >> good return. On the other hand, I worry that because we bypass the >> page cache, XFS doesn't get to see the entire file at one time and >> so it will get fragmented. > Yes. Your other option is to use an extent size hint that is smaller > than the sunit. That should not align to 1MB because the initial > data allocation size is not large enough to trigger stripe > alignment. Wow, so we had so many factors leading to this: - 1-disk installations arranged as RAID0 even though not strictly needed - having a default extent allocation hint, even for small files - having that default hint be >= the stripe unit size - the user not removing snapshots - XFS not falling back to unaligned allocations >> Suppose I write a 4k file with a 1MB hint. How is that trailing >> (1MB-4k) marked? Free extent, free extent with extra annotation, or >> allocated extent? We may need to deallocate those extents? (will >> FALLOC_FL_PUNCH_HOLE do the trick?) > It's an unwritten extent beyond EOF, and how that is treated when > the file is last closed depends on how that extent was allocated. > But, yes, punching the range beyond EOF will definitely free it. I think we can conclude from the dump that the filesystem freed it? >>>>>> Is this a known issue? >>> The effect and symptom is - it's a generic large aligned extent vs small unaligned extent >>> issue, but I've never seen it manifest in a user workload outside of >>> a very constrained multistream realtime video ingest/playout >>> workload (i.e. the workload the filestreams allocator was written >>> for). And before you ask, no, the filestreams allocator does not >>> solve this problem. >>> >>> The most common manifestation of this problem has been inode >>> allocation on filesystems full of small files - inodes are allocated >>> in large aligned extents compared to small files, and so eventually >>> the filesystem runs out of large contigouous freespace and inodes >>> can't be allocated. The sparse inodes mkfs option fixed this by >>> allowing inodes to be allocated as sparse chunks so they could >>> interleave into any free space available.... >> Shouldn't XFS fall back to a non-aligned allocation rather that >> returning ENOSPC on a filesystem with 90% free space? > The filesystem does fall back to unaligned allocation - there's ~5 > spearate, progressively less strict allocation attempts on failure. > > The problem is that the extent size hint is asking to allocate a > contiguous 32MB extent and there's no contiguous 32MB free space > extent available, aligned or not. That's what I think is generating > the ENOSPC error, but it's not clear to me from the code whether it > is supposed to ignore the extent size hint on failure and allocate a > set of shorter unaligned extents or not.... Here's a file from the dump: ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 1eb2: 3928e00.. 392acb2: 1eb3: 1: 1eb3.. 3cb2: 3c91200.. 3c92fff: 1e00: 392acb3: 2: 3cb3.. 57b2: 3454100.. 3455bff: 1b00: 3c93000: 3: 57b3.. 6fb2: 34ecd00.. 34ee4ff: 1800: 3455c00: 4: 6fb3.. 85fe: 3386a00.. 338804b: 164c: 34ee500: 5: 85ff.. 9c0b: 2c85c00.. 2c8720c: 160d: 338804c: 6: 9c0c.. b217: 3099900.. 309af0b: 160c: 2c8720d: 7: b218.. c823: 34fb300.. 34fc90b: 160c: 309af0c: 8: c824.. de2b: 315ef00.. 3160507: 1608: 34fc90c: 9: de2c.. f42f: 36adc00.. 36af203: 1604: 3160508: 10: f430.. 10a30: 2cf4400.. 2cf5a00: 1601: 36af204: 11: 10a31.. 12030: 2e03300.. 2e048ff: 1600: 2cf5a01: 12: 12031.. 13630: 2ff5200.. 2ff67ff: 1600: 2e04900: 13: 13631.. 14c30: 3199e00.. 319b3ff: 1600: 2ff6800: 14: 14c31.. 16230: 32ed500.. 32eeaff: 1600: 319b400: 15: 16231.. 17830: 34a0b00.. 34a20ff: 1600: 32eeb00: 16: 17831.. 18e30: 354e700.. 354fcff: 1600: 34a2100: 17: 18e31.. 1a430: 362c400.. 362d9ff: 1600: 354fd00: 18: 1a431.. 1ba1d: 3192b00.. 31940ec: 15ed: 362da00: 19: 1ba1e.. 1d05c: 4228500.. 4229b3e: 163f: 31940ed: 20: 1d05d.. 1e692: 3f6c900.. 3f6df35: 1636: 4229b3f: 21: 1e693.. 1fcc0: 37d4400.. 37d5a2d: 162e: 3f6df36: 22: 1fcc1.. 212e4: 43f9c00.. 43fb223: 1624: 37d5a2e: 23: 212e5.. 22905: 4003500.. 4004b20: 1621: 43fb224: 24: 22906.. 23803: 1fdb900.. 1fdc7fd: efe: 4004b21: last,eof So, lengths are not always aligned, but physical_offset always is. So XFS relaxes the extent size hint but not alignment. It looks like XFS allocates one extent and moves on, not trying to allocate all the way to the 32MB hint size. If that were the case, we'd see logical_offset restore alignment every 32MB. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: ENSOPC on a 10% used disk 2018-10-21 9:21 ` Avi Kivity @ 2018-10-21 15:06 ` Dave Chinner 0 siblings, 0 replies; 26+ messages in thread From: Dave Chinner @ 2018-10-21 15:06 UTC (permalink / raw) To: Avi Kivity; +Cc: linux-xfs On Sun, Oct 21, 2018 at 12:21:33PM +0300, Avi Kivity wrote: > > On 19/10/2018 04.15, Dave Chinner wrote: > >On Thu, Oct 18, 2018 at 02:00:19PM +0300, Avi Kivity wrote: > >>On 18/10/2018 13.05, Dave Chinner wrote: > >>>On Thu, Oct 18, 2018 at 10:55:18AM +0300, Avi Kivity wrote: > >>>>On 18/10/2018 04.37, Dave Chinner wrote: > >>Looks like we should remove that 1MB > >>hint since it's reducing allocation flexibility for XFS without a > >>good return. On the other hand, I worry that because we bypass the > >>page cache, XFS doesn't get to see the entire file at one time and > >>so it will get fragmented. > >Yes. Your other option is to use an extent size hint that is smaller > >than the sunit. That should not align to 1MB because the initial > >data allocation size is not large enough to trigger stripe > >alignment. > > > Wow, so we had so many factors leading to this: > > - 1-disk installations arranged as RAID0 even though not strictly needed > > - having a default extent allocation hint, even for small files > > - having that default hint be >= the stripe unit size > > - the user not removing snapshots > > - XFS not falling back to unaligned allocations Everything but the last is true. XFS is definitely dropping the alignment hint once there are no more aligned contiguous free space extents. > >>Suppose I write a 4k file with a 1MB hint. How is that trailing > >>(1MB-4k) marked? Free extent, free extent with extra annotation, or > >>allocated extent? We may need to deallocate those extents? (will > >>FALLOC_FL_PUNCH_HOLE do the trick?) > >It's an unwritten extent beyond EOF, and how that is treated when > >the file is last closed depends on how that extent was allocated. > >But, yes, punching the range beyond EOF will definitely free it. > > I think we can conclude from the dump that the filesystem freed it? *nod* > ext: logical_offset: physical_offset: length: expected: flags: > 0: 0.. 1eb2: 3928e00.. 392acb2: 1eb3: > 1: 1eb3.. 3cb2: 3c91200.. 3c92fff: 1e00: 392acb3: > 2: 3cb3.. 57b2: 3454100.. 3455bff: 1b00: 3c93000: > 3: 57b3.. 6fb2: 34ecd00.. 34ee4ff: 1800: 3455c00: > 4: 6fb3.. 85fe: 3386a00.. 338804b: 164c: 34ee500: > 5: 85ff.. 9c0b: 2c85c00.. 2c8720c: 160d: 338804c: > 6: 9c0c.. b217: 3099900.. 309af0b: 160c: 2c8720d: > 7: b218.. c823: 34fb300.. 34fc90b: 160c: 309af0c: > 8: c824.. de2b: 315ef00.. 3160507: 1608: 34fc90c: > 9: de2c.. f42f: 36adc00.. 36af203: 1604: 3160508: > 10: f430.. 10a30: 2cf4400.. 2cf5a00: 1601: 36af204: > 11: 10a31.. 12030: 2e03300.. 2e048ff: 1600: 2cf5a01: > 12: 12031.. 13630: 2ff5200.. 2ff67ff: 1600: 2e04900: > 13: 13631.. 14c30: 3199e00.. 319b3ff: 1600: 2ff6800: > 14: 14c31.. 16230: 32ed500.. 32eeaff: 1600: 319b400: > 15: 16231.. 17830: 34a0b00.. 34a20ff: 1600: 32eeb00: > 16: 17831.. 18e30: 354e700.. 354fcff: 1600: 34a2100: > 17: 18e31.. 1a430: 362c400.. 362d9ff: 1600: 354fd00: > 18: 1a431.. 1ba1d: 3192b00.. 31940ec: 15ed: 362da00: > 19: 1ba1e.. 1d05c: 4228500.. 4229b3e: 163f: 31940ed: > 20: 1d05d.. 1e692: 3f6c900.. 3f6df35: 1636: 4229b3f: > 21: 1e693.. 1fcc0: 37d4400.. 37d5a2d: 162e: 3f6df36: > 22: 1fcc1.. 212e4: 43f9c00.. 43fb223: 1624: 37d5a2e: > 23: 212e5.. 22905: 4003500.. 4004b20: 1621: 43fb224: > 24: 22906.. 23803: 1fdb900.. 1fdc7fd: efe: 4004b21: last,eof filefrag? I find that utterly unreadable, an dwithout the command line I don't know what the units are. can you use 'xfs_bmap -vvp' so that all the units are known and it automatically calculates whethere extents are aligned or not? > So, lengths are not always aligned, but physical_offset always is. > So XFS relaxes the extent size hint but not alignment. No, that is incorrect. Filesystems never do what people expect them to. i.e. what you see above is because the filesystem could not find large enough contiguous free spaces to align both the ends of the allocation. i.e. Freespace looks like: +----FF+FFFFFF+FFFFFF+FFFF-+------+ Alloc aligned w/ min len and max len +----FF+FFFFFF+FFFFFF+FFFF-+------+ +WANT-THIS-BIT_HERE-+ But the nearest target free space extent returns: fffffffffffffffffffff So we trim the front fffffffffffffffffff if len < min len, fail (didn't happen) if > max len, trim end (no trim, not long enough) And so we end up allocating front aligned and short: +WANT-THIS-BIT_HER+ Leaving behind: +----FF+------+------+-----+------+ That's why it looks like there are aligned extents remaining, even when there isn't. The allocation logic is horrifically complex - it has 20-something controlling parameters and a heap of logic, maths and fallback paths around them. Unless you're intimately familiar with the code, you're unlikely to infer the allocator decisions from an extent list.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: ENSOPC on a 10% used disk 2018-10-17 7:52 ENSOPC on a 10% used disk Avi Kivity 2018-10-17 8:47 ` Christoph Hellwig 2018-10-18 1:37 ` Dave Chinner @ 2018-10-18 15:54 ` Eric Sandeen 2018-10-21 11:49 ` Avi Kivity 2019-02-05 21:48 ` Dave Chinner 3 siblings, 1 reply; 26+ messages in thread From: Eric Sandeen @ 2018-10-18 15:54 UTC (permalink / raw) To: Avi Kivity, linux-xfs On 10/17/18 2:52 AM, Avi Kivity wrote: > I have a user running a 1.7TB filesystem with ~10% usage (as shown by df), getting sporadic ENOSPC errors. The disk is mounted with inode64 and has a relatively small number of large files. The disk is a single-member RAID0 array, with 1MB chunk size. There are 32 AGs. Running Linux 4.9.17. > > > The write load consists of AIO/DIO writes, followed by unlinks of these files. The writes are non-size-changing (we truncate ahead) and we use XFS_IOC_FSSETXATTR/XFS_FLAG_EXTSIZE with a hint size of 32MB. The errors happen on commit logs, which have a target size of 32MB (but may exceed it a little). > > > The errors are sporadic and after restarting the workload they go away for a few hours to a few days, but then return. During one of the crashes I used xfs_db to look at fragmentation and saw that most AGs had free extents of size categories up to 128-255, but a few had more. I tried xfs_fsr but it did not help. > > > Is this a known issue? Would upgrading the kernel help? > > > I'll try to get a metadata dump next time this happens, and I'll be happy to supply more information. It sounds like you all figured this out, but I'll drop a reference to One Weird Trick to figure out just what function is returning a specific error value (the example below is EINVAL) First is my hack, what follows was Dave's refinement. We should get this into scripts/ some day. > # for FUNCTION in `grep "t xfs_" /proc/kallsyms | awk '{print $3}'`; do echo "r:ret_$FUNCTION $FUNCTION \$retval" >> /sys/kernel/debug/tracing/kprobe_events; done > > # for ENABLE in /sys/kernel/debug/tracing/events/kprobes/ret_xfs_*/enable; do echo 1 > $ENABLE; done > > run a test that fails: > > # dd if=/dev/zero of=newfile bs=513 oflag=direct > dd: writing `newfile': Invalid argument > > # for ENABLE in /sys/kernel/debug/tracing/events/kprobes/ret_xfs_*/enable; do echo 0 > $ENABLE; done > > # cat /sys/kernel/debug/tracing/trace > <snip> > <...>-63791 [000] d... 705435.568913: ret_xfs_vn_mknod: (xfs_vn_create+0x13/0x20 [xfs] <- xfs_vn_mknod) arg1=0 > <...>-63791 [000] d... 705435.568913: ret_xfs_vn_create: (vfs_create+0xdb/0x100 <- xfs_vn_create) arg1=0 > <...>-63791 [000] d... 705435.568918: ret_xfs_file_open: (do_dentry_open+0x24e/0x2e0 <- xfs_file_open) arg1=0 > <...>-63791 [000] d... 705435.568934: ret_xfs_file_dio_aio_write: (xfs_file_aio_write+0x147/0x150 [xfs] <- xfs_file_dio_aio_write) arg1=ffffffffffffffea > > Hey look, it's "-22" in hex! > > so it's possible, but bleah. Dave later refined that to: > #!/bin/bash > > TRACEDIR=/sys/kernel/debug/tracing > > grep -i 't xfs_' /proc/kallsyms | awk '{print $3}' ; while read F; do > echo "r:ret_$F $F \$retval" >> $TRACEDIR/kprobe_events > done > > for E in $TRACEDIR/events/kprobes/ret_xfs_*/enable; do > echo 1 > $E > done; > > echo 'arg1 > 0xffffffffffffff00' > $TRACEDIR/events/kprobes/filter > > for T in $TRACEDIR/events/kprobes/ret_xfs_*/trigger; do > echo 'traceoff if arg1 > 0xffffffffffffff00' > $T > done > And that gives: > > # dd if=/dev/zero of=/mnt/scratch/newfile bs=513 oflag=direct > dd: error writing ¿/mnt/scratch/newfile¿: Invalid argument > 1+0 records in > 0+0 records out > 0 bytes (0 B) copied, 0.000259882 s, 0.0 kB/s > root@test4:~# cat /sys/kernel/debug/tracing/trace > # tracer: nop > # > # entries-in-buffer/entries-written: 1/1 #P:16 > # > # _-----=> irqs-off > # / _----=> need-resched > # | / _---=> hardirq/softirq > # || / _--=> preempt-depth > # ||| / delay > # TASK-PID CPU# |||| TIMESTAMP FUNCTION > # | | | |||| | | > <...>-8073 [006] d... 145740.460546: ret_xfs_file_dio_aio_write: (xfs_file_aio_write+0x170/0x180 <- xfs_file_dio_aio_write) arg1=0xffffffffffffffea > > Which is precisely the detection that XFS_ERROR would have given us. > Ok, so I guess we can now add whatever need need to that trigger... > > Basically, pass in teh XFs function names you want to trace, the > sets up teh events, whatever trigger beahviour you want, and > we're off to the races... ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: ENSOPC on a 10% used disk 2018-10-18 15:54 ` Eric Sandeen @ 2018-10-21 11:49 ` Avi Kivity 0 siblings, 0 replies; 26+ messages in thread From: Avi Kivity @ 2018-10-21 11:49 UTC (permalink / raw) To: Eric Sandeen, linux-xfs On 18/10/2018 18.54, Eric Sandeen wrote: > On 10/17/18 2:52 AM, Avi Kivity wrote: >> I have a user running a 1.7TB filesystem with ~10% usage (as shown by df), getting sporadic ENOSPC errors. The disk is mounted with inode64 and has a relatively small number of large files. The disk is a single-member RAID0 array, with 1MB chunk size. There are 32 AGs. Running Linux 4.9.17. >> >> >> The write load consists of AIO/DIO writes, followed by unlinks of these files. The writes are non-size-changing (we truncate ahead) and we use XFS_IOC_FSSETXATTR/XFS_FLAG_EXTSIZE with a hint size of 32MB. The errors happen on commit logs, which have a target size of 32MB (but may exceed it a little). >> >> >> The errors are sporadic and after restarting the workload they go away for a few hours to a few days, but then return. During one of the crashes I used xfs_db to look at fragmentation and saw that most AGs had free extents of size categories up to 128-255, but a few had more. I tried xfs_fsr but it did not help. >> >> >> Is this a known issue? Would upgrading the kernel help? >> >> >> I'll try to get a metadata dump next time this happens, and I'll be happy to supply more information. > It sounds like you all figured this out, but I'll drop a reference to > One Weird Trick to figure out just what function is returning a specific > error value (the example below is EINVAL) > > First is my hack, what follows was Dave's refinement. We should get this > into scripts/ some day. Cool, although to get noticed these days you have to put in bpf somewhere (and probably it can help with some kernel-side filtering - start logging as soon as you see the error, and hopefully you can recover the path from the returns). >> # for FUNCTION in `grep "t xfs_" /proc/kallsyms | awk '{print $3}'`; do echo "r:ret_$FUNCTION $FUNCTION \$retval" >> /sys/kernel/debug/tracing/kprobe_events; done >> >> # for ENABLE in /sys/kernel/debug/tracing/events/kprobes/ret_xfs_*/enable; do echo 1 > $ENABLE; done >> >> run a test that fails: >> >> # dd if=/dev/zero of=newfile bs=513 oflag=direct >> dd: writing `newfile': Invalid argument >> >> # for ENABLE in /sys/kernel/debug/tracing/events/kprobes/ret_xfs_*/enable; do echo 0 > $ENABLE; done >> >> # cat /sys/kernel/debug/tracing/trace >> <snip> >> <...>-63791 [000] d... 705435.568913: ret_xfs_vn_mknod: (xfs_vn_create+0x13/0x20 [xfs] <- xfs_vn_mknod) arg1=0 >> <...>-63791 [000] d... 705435.568913: ret_xfs_vn_create: (vfs_create+0xdb/0x100 <- xfs_vn_create) arg1=0 >> <...>-63791 [000] d... 705435.568918: ret_xfs_file_open: (do_dentry_open+0x24e/0x2e0 <- xfs_file_open) arg1=0 >> <...>-63791 [000] d... 705435.568934: ret_xfs_file_dio_aio_write: (xfs_file_aio_write+0x147/0x150 [xfs] <- xfs_file_dio_aio_write) arg1=ffffffffffffffea >> >> Hey look, it's "-22" in hex! >> >> so it's possible, but bleah. > Dave later refined that to: > >> #!/bin/bash >> >> TRACEDIR=/sys/kernel/debug/tracing >> >> grep -i 't xfs_' /proc/kallsyms | awk '{print $3}' ; while read F; do >> echo "r:ret_$F $F \$retval" >> $TRACEDIR/kprobe_events >> done >> >> for E in $TRACEDIR/events/kprobes/ret_xfs_*/enable; do >> echo 1 > $E >> done; >> >> echo 'arg1 > 0xffffffffffffff00' > $TRACEDIR/events/kprobes/filter >> >> for T in $TRACEDIR/events/kprobes/ret_xfs_*/trigger; do >> echo 'traceoff if arg1 > 0xffffffffffffff00' > $T >> done > > >> And that gives: >> >> # dd if=/dev/zero of=/mnt/scratch/newfile bs=513 oflag=direct >> dd: error writing ¿/mnt/scratch/newfile¿: Invalid argument >> 1+0 records in >> 0+0 records out >> 0 bytes (0 B) copied, 0.000259882 s, 0.0 kB/s >> root@test4:~# cat /sys/kernel/debug/tracing/trace >> # tracer: nop >> # >> # entries-in-buffer/entries-written: 1/1 #P:16 >> # >> # _-----=> irqs-off >> # / _----=> need-resched >> # | / _---=> hardirq/softirq >> # || / _--=> preempt-depth >> # ||| / delay >> # TASK-PID CPU# |||| TIMESTAMP FUNCTION >> # | | | |||| | | >> <...>-8073 [006] d... 145740.460546: ret_xfs_file_dio_aio_write: (xfs_file_aio_write+0x170/0x180 <- xfs_file_dio_aio_write) arg1=0xffffffffffffffea >> >> Which is precisely the detection that XFS_ERROR would have given us. >> Ok, so I guess we can now add whatever need need to that trigger... >> >> Basically, pass in teh XFs function names you want to trace, the >> sets up teh events, whatever trigger beahviour you want, and >> we're off to the races... > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: ENSOPC on a 10% used disk 2018-10-17 7:52 ENSOPC on a 10% used disk Avi Kivity ` (2 preceding siblings ...) 2018-10-18 15:54 ` Eric Sandeen @ 2019-02-05 21:48 ` Dave Chinner 2019-02-07 10:51 ` Avi Kivity 3 siblings, 1 reply; 26+ messages in thread From: Dave Chinner @ 2019-02-05 21:48 UTC (permalink / raw) To: Avi Kivity; +Cc: linux-xfs Hi Avi, On Wed, Oct 17, 2018 at 10:52:48AM +0300, Avi Kivity wrote: > I have a user running a 1.7TB filesystem with ~10% usage (as shown > by df), getting sporadic ENOSPC errors. The disk is mounted with > inode64 and has a relatively small number of large files. The disk > is a single-member RAID0 array, with 1MB chunk size. There are 32 > AGs. Running Linux 4.9.17. > > > The write load consists of AIO/DIO writes, followed by unlinks of > these files. The writes are non-size-changing (we truncate ahead) > and we use XFS_IOC_FSSETXATTR/XFS_FLAG_EXTSIZE with a hint size of > 32MB. The errors happen on commit logs, which have a target size of > 32MB (but may exceed it a little). > > > The errors are sporadic and after restarting the workload they go > away for a few hours to a few days, but then return. During one of > the crashes I used xfs_db to look at fragmentation and saw that most > AGs had free extents of size categories up to 128-255, but a few had > more. I tried xfs_fsr but it did not help. > > > Is this a known issue? Would upgrading the kernel help? Long time, I know, but Brian has just made me aware of this commit from early 2018 that went into 4.16 that might be relevant and so I thought it best to close the loop: commit 6d8a45ce29c7d67cc4fc3016dc2a07660c62482a Author: Darrick J. Wong <darrick.wong@oracle.com> Date: Fri Jan 19 17:47:36 2018 -0800 xfs: don't screw up direct writes when freesp is fragmented xfs_bmap_btalloc is given a range of file offset blocks that must be allocated to some data/attr/cow fork. If the fork has an extent size hint associated with it, the request will be enlarged on both ends to try to satisfy the alignment hint. If free space is fragmentated, sometimes we can allocate some blocks but not enough to fulfill any of the requested range. Since bmapi_allocate always trims the new extent mapping to match the originally requested range, this results in bmapi_write returning zero and no mapping. The consequences of this vary -- buffered writes will simply re-call bmapi_write until it can satisfy at least one block from the original request. Direct IO overwrites notice nmaps == 0 and return -ENOSPC through the dio mechanism out to userspace with the weird result that writes fail even when we have enough space because the ENOSPC return overrides any partial write status. For direct CoW writes the situation was disastrous because nobody notices us returning an invalid zero-length wrong-offset mapping to iomap and the write goes off into space. Therefore, if free space is so fragmented that we managed to allocate some space but not enough to map into even a single block of the original allocation request range, we should break the alignment hint in order to guarantee at least some forward progress for the direct write. If we return a short allocation to iomap_apply it'll call back about the remaining blocks. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> The spurious ENOSPC symptoms seem to match what you are seeing here on your customer's 4.9 kernel, so it may be that this is the fix for the ENOSPC problem that was reported. If this comes up again, then perhaps it would be worth either upgrading the kernel to 4.16+ or backporting this commit to see if it fixes the problem. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: ENSOPC on a 10% used disk 2019-02-05 21:48 ` Dave Chinner @ 2019-02-07 10:51 ` Avi Kivity 0 siblings, 0 replies; 26+ messages in thread From: Avi Kivity @ 2019-02-07 10:51 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs On 05/02/2019 23.48, Dave Chinner wrote: > Hi Avi, > > On Wed, Oct 17, 2018 at 10:52:48AM +0300, Avi Kivity wrote: >> I have a user running a 1.7TB filesystem with ~10% usage (as shown >> by df), getting sporadic ENOSPC errors. The disk is mounted with >> inode64 and has a relatively small number of large files. The disk >> is a single-member RAID0 array, with 1MB chunk size. There are 32 >> AGs. Running Linux 4.9.17. >> >> >> The write load consists of AIO/DIO writes, followed by unlinks of >> these files. The writes are non-size-changing (we truncate ahead) >> and we use XFS_IOC_FSSETXATTR/XFS_FLAG_EXTSIZE with a hint size of >> 32MB. The errors happen on commit logs, which have a target size of >> 32MB (but may exceed it a little). >> >> >> The errors are sporadic and after restarting the workload they go >> away for a few hours to a few days, but then return. During one of >> the crashes I used xfs_db to look at fragmentation and saw that most >> AGs had free extents of size categories up to 128-255, but a few had >> more. I tried xfs_fsr but it did not help. >> >> >> Is this a known issue? Would upgrading the kernel help? > Long time, I know, but Brian has just made me aware of this commit > from early 2018 that went into 4.16 that might be relevant and so I > thought it best to close the loop: > > commit 6d8a45ce29c7d67cc4fc3016dc2a07660c62482a > Author: Darrick J. Wong <darrick.wong@oracle.com> > Date: Fri Jan 19 17:47:36 2018 -0800 > > xfs: don't screw up direct writes when freesp is fragmented > > xfs_bmap_btalloc is given a range of file offset blocks that must be > allocated to some data/attr/cow fork. If the fork has an extent size > hint associated with it, the request will be enlarged on both ends to > try to satisfy the alignment hint. If free space is fragmentated, > sometimes we can allocate some blocks but not enough to fulfill any of > the requested range. Since bmapi_allocate always trims the new extent > mapping to match the originally requested range, this results in > bmapi_write returning zero and no mapping. > > The consequences of this vary -- buffered writes will simply re-call > bmapi_write until it can satisfy at least one block from the original > request. Direct IO overwrites notice nmaps == 0 and return -ENOSPC > through the dio mechanism out to userspace with the weird result that > writes fail even when we have enough space because the ENOSPC return > overrides any partial write status. For direct CoW writes the situation > was disastrous because nobody notices us returning an invalid zero-length > wrong-offset mapping to iomap and the write goes off into space. > > Therefore, if free space is so fragmented that we managed to allocate > some space but not enough to map into even a single block of the > original allocation request range, we should break the alignment hint in > order to guarantee at least some forward progress for the direct write. > If we return a short allocation to iomap_apply it'll call back about the > remaining blocks. > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> > Reviewed-by: Christoph Hellwig <hch@lst.de> > > The spurious ENOSPC symptoms seem to match what you are seeing here > on your customer's 4.9 kernel, so it may be that this is the fix for > the ENOSPC problem that was reported. If this comes up again, then > perhaps it would be worth either upgrading the kernel to 4.16+ or > backporting this commit to see if it fixes the problem. Thanks for remembering. Indeed it looks like a good match for the problem. We did not see the problem again (it took quite a combination of screwups to achieve), but I'll remember this in case that we do. ^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2019-02-07 10:51 UTC | newest] Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-10-17 7:52 ENSOPC on a 10% used disk Avi Kivity 2018-10-17 8:47 ` Christoph Hellwig 2018-10-17 8:57 ` Avi Kivity 2018-10-17 10:54 ` Avi Kivity 2018-10-18 1:37 ` Dave Chinner 2018-10-18 7:55 ` Avi Kivity 2018-10-18 10:05 ` Dave Chinner 2018-10-18 11:00 ` Avi Kivity 2018-10-18 13:36 ` Avi Kivity 2018-10-19 7:51 ` Dave Chinner 2018-10-21 8:55 ` Avi Kivity 2018-10-21 14:28 ` Dave Chinner 2018-10-22 8:35 ` Avi Kivity 2018-10-22 9:52 ` Dave Chinner 2018-10-18 15:44 ` Avi Kivity 2018-10-18 16:11 ` Avi Kivity 2018-10-19 1:24 ` Dave Chinner 2018-10-21 9:00 ` Avi Kivity 2018-10-21 14:34 ` Dave Chinner 2018-10-19 1:15 ` Dave Chinner 2018-10-21 9:21 ` Avi Kivity 2018-10-21 15:06 ` Dave Chinner 2018-10-18 15:54 ` Eric Sandeen 2018-10-21 11:49 ` Avi Kivity 2019-02-05 21:48 ` Dave Chinner 2019-02-07 10:51 ` Avi Kivity
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.