* Do different btrfs volumes compete for CPU?
@ 2017-03-31 7:05 Marat Khalili
2017-03-31 11:49 ` Duncan
0 siblings, 1 reply; 9+ messages in thread
From: Marat Khalili @ 2017-03-31 7:05 UTC (permalink / raw)
To: linux-btrfs
Approximately 16 hours ago I've run a script that deleted >~100
snapshots and started quota rescan on a large USB-connected btrfs volume
(5.4 of 22 TB occupied now). Quota rescan only completed just now, with
100% load from [btrfs-transacti] throughout this period, which is
probably ~ok depending on your view on things.
What worries me is innocent process using _another_, SATA-connected
btrfs volume that hung right after I started my script and took >30
minutes to be sigkilled. There's nothing interesting in the kernel log,
and attempts to attach strace to the process output nothing, but I of
course suspect that it freezed on disk operation.
I wonder:
1) Can there be a contention for CPU or some mutexes between kernel
btrfs threads belonging to different volumes?
2) If yes, can anything be done about it other than mounting volumes
from (different) VMs?
> $ uname -a; btrfs --version
> Linux host 4.4.0-66-generic #87-Ubuntu SMP Fri Mar 3 15:29:05 UTC 2017
> x86_64 x86_64 x86_64 GNU/Linux
> btrfs-progs v4.4
--
With Best Regards,
Marat Khalili
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Do different btrfs volumes compete for CPU? 2017-03-31 7:05 Do different btrfs volumes compete for CPU? Marat Khalili @ 2017-03-31 11:49 ` Duncan 2017-03-31 12:28 ` Marat Khalili 0 siblings, 1 reply; 9+ messages in thread From: Duncan @ 2017-03-31 11:49 UTC (permalink / raw) To: linux-btrfs Marat Khalili posted on Fri, 31 Mar 2017 10:05:20 +0300 as excerpted: > Approximately 16 hours ago I've run a script that deleted >~100 > snapshots and started quota rescan on a large USB-connected btrfs volume > (5.4 of 22 TB occupied now). Quota rescan only completed just now, with > 100% load from [btrfs-transacti] throughout this period, which is > probably ~ok depending on your view on things. > > What worries me is innocent process using _another_, SATA-connected > btrfs volume that hung right after I started my script and took >30 > minutes to be sigkilled. There's nothing interesting in the kernel log, > and attempts to attach strace to the process output nothing, but I of > course suspect that it freezed on disk operation. > > I wonder: > 1) Can there be a contention for CPU or some mutexes between kernel > btrfs threads belonging to different volumes? > 2) If yes, can anything be done about it other than mounting volumes > from (different) VMs? > > >> $ uname -a; btrfs --version >> Linux host 4.4.0-66-generic #87-Ubuntu SMP >> Fri Mar 3 15:29:05 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux >> btrfs-progs v4.4 What would have been interesting would have been if you had any reports from for instance htop during that time, showing wait percentage on the various cores and status (probably D, disk-wait) of the innocent process. iotop output would of course have been even better, but also rather more special-case so less commonly installed. I believe you will find that the problem isn't btrfs, but rather, I/O contention, and that if you try the same thing with one of the filesystems being for instance ext4, you'll see the same problem there as well, which because the two filesystems are then not the same type should well demonstrate that it's not a problem at the filesystem level, but rather elsewhere. USB is infamous for being an I/O bottleneck, slowing things down both for it, and on less than perfectly configured systems, often for data access on other devices as well. SATA can and does do similar things too, but because it tends to be more efficient in general, it doesn't tend to make things as drastically bad for as long as USB can. There's some knobs you can twist for better interactivity, but I need to be up to go to work in a couple hours so will leave it to other posters to make suggestions in that regard at this point. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Do different btrfs volumes compete for CPU? 2017-03-31 11:49 ` Duncan @ 2017-03-31 12:28 ` Marat Khalili 2017-04-01 2:04 ` Duncan 2017-04-01 10:17 ` Peter Grandi 0 siblings, 2 replies; 9+ messages in thread From: Marat Khalili @ 2017-03-31 12:28 UTC (permalink / raw) To: linux-btrfs Thank you very much for reply and suggestions, more comments below. Still, is there a definite answer on root question: are different btrfs volumes independent in terms of CPU, or are there some shared workers that can be point of contention? > What would have been interesting would have been if you had any reports > from for instance htop during that time, showing wait percentage on the > various cores and status (probably D, disk-wait) of the innocent > process. iotop output would of course have been even better, but also > rather more special-case so less commonly installed. Curiously, I have had iotop but not htop running. [btrfs-transacti] had some low-level activity in iotop (I still assume it was CPU-limited), the innocent process did not have any activity anywhere. Next time I'll also take notice of process state in ps (sadly, my omission). > I believe you will find that the problem isn't btrfs, but rather, I/O > contention This possibility did not come to my mind. Can USB drivers be still that bad in 4.4? Is there any way to discriminate these two situations (btrfs vs usb load)? BTW, USB adapter used is this one (though storage array only supports USB 3.0): https://www.asus.com/Motherboard-Accessory/USB_31_TYPEA_CARD/ > and that if you try the same thing with one of the > filesystems being for instance ext4, you'll see the same problem there as > well Not sure if it's possible to reproduce the problem with ext4, since it's not possible to perform such extensive metadata operations there, and simply moving large amount of data never created any problems for me regardless of filesystem. -- With Best Regards, Marat Khalili ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Do different btrfs volumes compete for CPU? 2017-03-31 12:28 ` Marat Khalili @ 2017-04-01 2:04 ` Duncan 2017-04-01 10:17 ` Peter Grandi 1 sibling, 0 replies; 9+ messages in thread From: Duncan @ 2017-04-01 2:04 UTC (permalink / raw) To: linux-btrfs Marat Khalili posted on Fri, 31 Mar 2017 15:28:20 +0300 as excerpted: >> and that if you try the same thing with one of the filesystems being >> for instance ext4, you'll see the same problem there as well > Not sure if it's possible to reproduce the problem with ext4, since it's > not possible to perform such extensive metadata operations there, and > simply moving large amount of data never created any problems for me > regardless of filesystem. Try ext4 as the one hosting the innocent process... And you said moving large amounts of data never triggered problems, but were you doing that over USB? As for knobs I mentioned... I'm not particularly sure about the knobs on USB, but... For instance on my old PCI-X (pre-PCIE) server board, the BIOS had a setting for size of PCI transfer. Given that each transfer has an effectively fixed overhead and the bus itself has a maximum bandwidth, the actually reasonably common elsewhere as well tradeoff was between high thruput (due to lower transfer overhead) at larger transfer sizes, but at the expense of interactivity and other processes having to wait for the transfer to complete, and better interactivity and shorter waits on a full bus at lower transfer sizes, at the expense of thruput due to higher transfer overhead. I was having trouble with music cutouts and tried various Linux and ALSA settings to no avail, but once I set the BIOS to a much lower PCI transfer size, everything functioned much more smoothly, not just the music, but the mouse, less waiting on disk reads (because the writes were shorter), etc. I /think/ the USB knobs are all in the kernel, but believe there's similar transfer size knobs there, if you know where to look. Beyond that, there's more generic IO knobs as listed below, but if it was CPU not IO blocking, then they might not help in this context, but it's worth knowing about them, particularly the dirty_* stuff mentioned last, anyway. (USB is much more CPU intensive than most transfer buses, one reason Intel pushed it so hard as opposed to say firewire, which offloads far more to the bus hardware and thus isn't as CPU intensive. So the USB knobs may well be worth investigating even if it was CPU. I just wish I knew more about them.) There's also the IO-scheduler. CFQ has long been the default, but you might try deadline, and there's now multiqueue-deadline (aka MQ deadline) as well. NoOp is occasionally recommended for certain SSD use-cases, but it's not appropriate for spinning rust. Of course most of the schedulers have detail knobs you can twist too, but I'm not sufficiently knowledgeable about those to say much about them. And 4.10 introduced the block-device writeback throttling global option (BLK_WBT) along with separate options underneath it for single-queue and multi-queue writeback throttling. I turned those on here, but as most of my system's on fast ssd, I didn't notice, nor did I expect to notice, much difference. However, in theory it could make quite some difference with USB-based storage, particularly slow thumb-drives and spinning rust. Last but certainly not least as it can make quite a difference, and indeed did make a difference here back when I was on spinning rust, there's the dirty-data write-caching typically configured via the distro's sysctrl mechanism, but which can be manually configured via the /proc/sys/vm/dirty_* files. The writeback-throttling features mentioned above may eventually reduce the need to tweak these, but until they're in commonly deployed kernels, tweaking these settings can make QUITE a big difference, because the percentage-of-RAM defaults were configured back in the day when 64 MB of RAM was big, and they simply aren't appropriate to modern systems with often double-digit GiB RAM. I'll skip the details here as there's plenty of writeups on the web about tweaking these, as well as kernel text-file documentation, but you may want to look into this if you haven't, because as I said it can make a HUGE difference in effective system interactivity. That's what I know of. I'd be a lot more comfortable with things if someone else had confirmed my original post as I'm not a dev, just a btrfs user and list regular, but I do know we've not had a lot of reports of this sort of problem posted, and when we have in the past and it was actually separate btrfss, it turned out it was /not/ btrfs, so I'm /reasonably/ sure about it. I also run multiple btrfs here and haven't seen the issue, but they're all on the same pair of partitioned quite fast ssds on SATA, so the comparison is admittedly of highly limited value. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Do different btrfs volumes compete for CPU? 2017-03-31 12:28 ` Marat Khalili 2017-04-01 2:04 ` Duncan @ 2017-04-01 10:17 ` Peter Grandi 2017-04-03 8:02 ` Marat Khalili 1 sibling, 1 reply; 9+ messages in thread From: Peter Grandi @ 2017-04-01 10:17 UTC (permalink / raw) To: linux-btrfs >> Approximately 16 hours ago I've run a script that deleted >> >~100 snapshots and started quota rescan on a large >> USB-connected btrfs volume (5.4 of 22 TB occupied now). That "USB-connected is a rather bad idea. On the IRC channel #Btrfs whenever someone reports odd things happening I ask "is that USB?" and usually it is and then we say "good luck!" :-). The issues are: * The USB mass storage protocol is poorly designed in particular for error handling. * The underlying USB protocol is very CPU intensive. * Most importantly nearly all USB chipsets, both system-side and peripheral-side, are breathtakingly buggy, but this does not get noticed for most USB devices. >> Quota rescan only completed just now, with 100% load from >> [btrfs-transacti] throughout this period, > [ ... ] are different btrfs volumes independent in terms of > CPU, or are there some shared workers that can be point of > contention? As written that question is meaningless: despite the current mania for "threads"/"threadlets" a filesystem driver is a library, not a set of processes (all those '[btrfs-*]' threadlets are somewhat misguided ways to do background stuff). The real problems here are: * Qgroups are famously system CPU intensive, even if less so than in earlier releases, especially with subvolumes, so the 16 hours CPU is both absurd and expected. I think that qgroups are still effectively unusable. * The scheduler gives excessive priority to kernel threads, so they can crowd out user processes. When for whatever reason the system CPU percentage rises everything else usually suffers. > BTW, USB adapter used is this one (though storage array only > supports USB 3.0): > https://www.asus.com/Motherboard-Accessory/USB_31_TYPEA_CARD/ Only Intel/AMD USB chipsets and a few others are fairly reliable, and for mass storage only with USB3 with UASPI, which is basically SATA-over-USB (more precisely SCSI-command-set over USB). Your system-side card seems to be recent enough to do UASPI, but probably the peripheral-side chipset isn't. Things are so bad with third-party chipsets that even several types of add-on SATA and SAS cards are too buggy. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Do different btrfs volumes compete for CPU? 2017-04-01 10:17 ` Peter Grandi @ 2017-04-03 8:02 ` Marat Khalili 2017-04-04 17:36 ` Peter Grandi 0 siblings, 1 reply; 9+ messages in thread From: Marat Khalili @ 2017-04-03 8:02 UTC (permalink / raw) To: Peter Grandi, linux-btrfs On 01/04/17 13:17, Peter Grandi wrote: > That "USB-connected is a rather bad idea. On the IRC channel > #Btrfs whenever someone reports odd things happening I ask "is > that USB?" and usually it is and then we say "good luck!" :-). You're right, but USB/eSATA arrays are dirt cheap in comparison with similar-performance SAN/NAS etc. things, that we unfortunately cannot really afford here. Just a bit of a back-story: I tried to use eSATA and ext4 first, but observed silent data corruption and irrecoverable kernel hangs -- apparently, SATA is not really designed for external use. That's when I switched to both USB and, coincidently, btrfs, and stability became orders of magnitude better even on re-purposed consumer-grade PC (Z77 chipset, 3rd gen. i5) with horribly outdated kernel. Now I'm rebuilding same configuration on server-grade hardware (C610 chipset, 40 io-channel Xeon) and modern kernel, and thus would be very surprised to find problems in USB throughput. > As written that question is meaningless: despite the current > mania for "threads"/"threadlets" a filesystem driver is a > library, not a set of processes (all those '[btrfs-*]' > threadlets are somewhat misguided ways to do background > stuff). But these threadlets, misguided as the are, do exist, don't they? > * Qgroups are famously system CPU intensive, even if less so > than in earlier releases, especially with subvolumes, so the > 16 hours CPU is both absurd and expected. I think that qgroups > are still effectively unusable. I understand that qgroups is very much work in progress, but (correct me if I'm wrong) right now it's the only way to estimate real usage of subvolume and its snapshots. For instance, if I have dozen 1TB subvolumes each having ~50 snapshots and suddenly run out of space on a 24TB volume, how do I find the culprit without qgroups? Keeping eye on storage use is essential for any real life use of snapshots, and they are too convenient as backup de-duplication tool to give up. Just a stray thought: btrfs seem to lack object type in between of volume and subvolume, that would keep track of storage use by several subvolumes+their snapshots, allow snapshotting/transferring multiple subvolumes at once etc. Some kind of super-subvolume (supervolume?) that is hierarchical. With increasing use of subvolumes/snapshots within a single system installation, and multiple system installations (belonging to different users) in one volume due to liberal use of LXC and similar technologies this will become more and more of a pressing problem. > * The scheduler gives excessive priority to kernel threads, so > they can crowd out user processes. When for whatever reason > the system CPU percentage rises everything else usually > suffers. I thought it was clear, but probably needs spelling out: while 1 core was completely occupied with [btrfs-transacti] thread, 5 more were mostly idle serving occasional network requests without any problems. And only a process that used storage intensively died. Fortunately or not, it's the only data point so far -- smaller snapshot cullings do not cause problems. > Only Intel/AMD USB chipsets and a few others are fairly > reliable, and for mass storage only with USB3 with UASPI, which > is basically SATA-over-USB (more precisely SCSI-command-set over > USB). Your system-side card seems to be recent enough to do > UASPI, but probably the peripheral-side chipset isn't. Things > are so bad with third-party chipsets that even several types of > add-on SATA and SAS cards are too buggy. Thank you very much for this hint. The card is indeed unknown factor here and I'll keep a close eye on it. The chip is ASM1142, not Intel/AMD sadly but quite popular nevertheless. -- With Best Regards, Marat Khalili ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Do different btrfs volumes compete for CPU? 2017-04-03 8:02 ` Marat Khalili @ 2017-04-04 17:36 ` Peter Grandi 2017-04-05 7:04 ` Marat Khalili 0 siblings, 1 reply; 9+ messages in thread From: Peter Grandi @ 2017-04-04 17:36 UTC (permalink / raw) To: Linux fs Btrfs > [ ... ] I tried to use eSATA and ext4 first, but observed > silent data corruption and irrecoverable kernel hangs -- > apparently, SATA is not really designed for external use. SATA works for external use, eSATA works well, but what really matters is the chipset of the adapter card. In my experience JMicron is not so good, Marvell a bit better, best is to use a recent motherboard chipset with a SATA-eSATA internal cable and bracket. >> As written that question is meaningless: despite the current >> mania for "threads"/"threadlets" a filesystem driver is a >> library, not a set of processes (all those '[btrfs-*]' >> threadlets are somewhat misguided ways to do background >> stuff). > But these threadlets, misguided as the are, do exist, don't > they? But that does not change the fact that it is a library and work is initiated by user requests which are not per-subvolume, but in effect per-volume. > I understand that qgroups is very much work in progress, but > (correct me if I'm wrong) right now it's the only way to > estimate real usage of subvolume and its snapshots. It is a way to do so and not a very good way. There is no obviously good way to define "real usage" in the presence of hard-links and reflinking, and qgroups use just one way to define it. A similar problem happens with processes in the presence of shared pages, multiple mapped shared libraries etc. > For instance, if I have dozen 1TB subvolumes each having ~50 > snapshots and suddenly run out of space on a 24TB volume, how > do I find the culprit without qgroups? It is not clear what "culprit" means here. The problem is that both hard-links and ref-linking create really significant ambiguities as to used space. Plus the same problem would happen with directories instead of subvolumes and hard-links instead of reflinked snapshots. > [ ... ] The chip is ASM1142, not Intel/AMD sadly but quite > popular nevertheless. ASMedia USB3 chipsets are fairly reliable at the least the card ones on the system side. The ones on the disk side I don't know much about. I have seen some ASMedia one that also seem OK. For the disks I use a Seagate and a WDC external box from which I have removed the original disk, as I have noticed that Seagate and WDC for obvious reasons tend to test and use the more reliable chipsets. I have also got an external USB3 dock with a recent ASMedia chipset that also seems good, but I haven't used it much. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Do different btrfs volumes compete for CPU? 2017-04-04 17:36 ` Peter Grandi @ 2017-04-05 7:04 ` Marat Khalili 2017-04-07 0:17 ` Martin 0 siblings, 1 reply; 9+ messages in thread From: Marat Khalili @ 2017-04-05 7:04 UTC (permalink / raw) To: Linux fs Btrfs On 04/04/17 20:36, Peter Grandi wrote: > SATA works for external use, eSATA works well, but what really > matters is the chipset of the adapter card. eSATA might be sound electrically, but mechanically it is awful. Try to run it for months in a crowded server room, and inevitably you'll get disconnections and data corruption. Tried different cables, brackets -- same result. If you ever used eSATA connector, you'd feel it. > In my experience JMicron is not so good, Marvell a bit better, > best is to use a recent motherboard chipset with a SATA-eSATA > internal cable and bracket. That's exactly what I used to use: internal controller of Z77 chipset + bracket(s). > But that does not change the fact that it is a library and work > is initiated by user requests which are not per-subvolume, but > in effect per-volume. That's the answer I was looking for. > It is a way to do so and not a very good way. There is no > obviously good way to define "real usage" in the presence of > hard-links and reflinking, and qgroups use just one way to > define it. A similar problem happens with processes in the > presence of shared pages, multiple mapped shared libraries etc. No need to over-generalize. There's an obvious good way to define "real usage" of a subvolume and its snapshots as long as it don't share any data with other subvolumes, as is often the case. If it does share, two figures -- exclusive and referenced, like in qgroups -- are sufficient for most tasks. > The problem is that > both hard-links and ref-linking create really significant > ambiguities as to used space. Plus the same problem would happen > with directories instead of subvolumes and hard-links instead of > reflinked snapshots. You're right, although with hard-links there's at least remote chance to estimate storage use with usermode scripts. > ASMedia USB3 chipsets are fairly reliable at the least the card > ones on the system side. The ones on the disk side I don't know > much about. This is getting increasingly off-topic, but our mainstay are CFI 5-disk DAS boxes (8253JDGG to be exact) filled with WD Red-s in RAID5 configuration. They are no longer produced and getting harder and harder to source, but showed themselves as very reliable. According to lsusb they contain JMicron JMS567 SATA 6Gb/s bridge. -- With Best Regards, Marat Khalili ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Do different btrfs volumes compete for CPU? 2017-04-05 7:04 ` Marat Khalili @ 2017-04-07 0:17 ` Martin 0 siblings, 0 replies; 9+ messages in thread From: Martin @ 2017-04-07 0:17 UTC (permalink / raw) To: linux-btrfs On 05/04/17 08:04, Marat Khalili wrote: > On 04/04/17 20:36, Peter Grandi wrote: >> SATA works for external use, eSATA works well, but what really >> matters is the chipset of the adapter card. > eSATA might be sound electrically, but mechanically it is awful. Try to > run it for months in a crowded server room, and inevitably you'll get > disconnections and data corruption. Tried different cables, brackets -- > same result. If you ever used eSATA connector, you'd feel it. Been using eSATA here for multiple disk packs continuously connected for a few years now for 48TB of data (not enough room in the host for the disks). Never suffered am eSATA disconnect. Had the usual cooling fan fails and HDD fails due to old age. All just a case of ensuring undisturbed clean cabling and a good UPS?... (BTRFS spanning four disks per external pack has worked well also.) Good luck, Martin ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2017-04-07 0:18 UTC | newest] Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-03-31 7:05 Do different btrfs volumes compete for CPU? Marat Khalili 2017-03-31 11:49 ` Duncan 2017-03-31 12:28 ` Marat Khalili 2017-04-01 2:04 ` Duncan 2017-04-01 10:17 ` Peter Grandi 2017-04-03 8:02 ` Marat Khalili 2017-04-04 17:36 ` Peter Grandi 2017-04-05 7:04 ` Marat Khalili 2017-04-07 0:17 ` Martin
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.