* Do different btrfs volumes compete for CPU?
@ 2017-03-31 7:05 Marat Khalili
2017-03-31 11:49 ` Duncan
0 siblings, 1 reply; 9+ messages in thread
From: Marat Khalili @ 2017-03-31 7:05 UTC (permalink / raw)
To: linux-btrfs
Approximately 16 hours ago I've run a script that deleted >~100
snapshots and started quota rescan on a large USB-connected btrfs volume
(5.4 of 22 TB occupied now). Quota rescan only completed just now, with
100% load from [btrfs-transacti] throughout this period, which is
probably ~ok depending on your view on things.
What worries me is innocent process using _another_, SATA-connected
btrfs volume that hung right after I started my script and took >30
minutes to be sigkilled. There's nothing interesting in the kernel log,
and attempts to attach strace to the process output nothing, but I of
course suspect that it freezed on disk operation.
I wonder:
1) Can there be a contention for CPU or some mutexes between kernel
btrfs threads belonging to different volumes?
2) If yes, can anything be done about it other than mounting volumes
from (different) VMs?
> $ uname -a; btrfs --version
> Linux host 4.4.0-66-generic #87-Ubuntu SMP Fri Mar 3 15:29:05 UTC 2017
> x86_64 x86_64 x86_64 GNU/Linux
> btrfs-progs v4.4
--
With Best Regards,
Marat Khalili
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Do different btrfs volumes compete for CPU?
2017-03-31 7:05 Do different btrfs volumes compete for CPU? Marat Khalili
@ 2017-03-31 11:49 ` Duncan
2017-03-31 12:28 ` Marat Khalili
0 siblings, 1 reply; 9+ messages in thread
From: Duncan @ 2017-03-31 11:49 UTC (permalink / raw)
To: linux-btrfs
Marat Khalili posted on Fri, 31 Mar 2017 10:05:20 +0300 as excerpted:
> Approximately 16 hours ago I've run a script that deleted >~100
> snapshots and started quota rescan on a large USB-connected btrfs volume
> (5.4 of 22 TB occupied now). Quota rescan only completed just now, with
> 100% load from [btrfs-transacti] throughout this period, which is
> probably ~ok depending on your view on things.
>
> What worries me is innocent process using _another_, SATA-connected
> btrfs volume that hung right after I started my script and took >30
> minutes to be sigkilled. There's nothing interesting in the kernel log,
> and attempts to attach strace to the process output nothing, but I of
> course suspect that it freezed on disk operation.
>
> I wonder:
> 1) Can there be a contention for CPU or some mutexes between kernel
> btrfs threads belonging to different volumes?
> 2) If yes, can anything be done about it other than mounting volumes
> from (different) VMs?
>
>
>> $ uname -a; btrfs --version
>> Linux host 4.4.0-66-generic #87-Ubuntu SMP
>> Fri Mar 3 15:29:05 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
>> btrfs-progs v4.4
What would have been interesting would have been if you had any reports
from for instance htop during that time, showing wait percentage on the
various cores and status (probably D, disk-wait) of the innocent
process. iotop output would of course have been even better, but also
rather more special-case so less commonly installed.
I believe you will find that the problem isn't btrfs, but rather, I/O
contention, and that if you try the same thing with one of the
filesystems being for instance ext4, you'll see the same problem there as
well, which because the two filesystems are then not the same type should
well demonstrate that it's not a problem at the filesystem level, but
rather elsewhere.
USB is infamous for being an I/O bottleneck, slowing things down both for
it, and on less than perfectly configured systems, often for data access
on other devices as well. SATA can and does do similar things too, but
because it tends to be more efficient in general, it doesn't tend to make
things as drastically bad for as long as USB can.
There's some knobs you can twist for better interactivity, but I need to
be up to go to work in a couple hours so will leave it to other posters
to make suggestions in that regard at this point.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Do different btrfs volumes compete for CPU?
2017-03-31 11:49 ` Duncan
@ 2017-03-31 12:28 ` Marat Khalili
2017-04-01 2:04 ` Duncan
2017-04-01 10:17 ` Peter Grandi
0 siblings, 2 replies; 9+ messages in thread
From: Marat Khalili @ 2017-03-31 12:28 UTC (permalink / raw)
To: linux-btrfs
Thank you very much for reply and suggestions, more comments below.
Still, is there a definite answer on root question: are different btrfs
volumes independent in terms of CPU, or are there some shared workers
that can be point of contention?
> What would have been interesting would have been if you had any reports
> from for instance htop during that time, showing wait percentage on the
> various cores and status (probably D, disk-wait) of the innocent
> process. iotop output would of course have been even better, but also
> rather more special-case so less commonly installed.
Curiously, I have had iotop but not htop running. [btrfs-transacti] had
some low-level activity in iotop (I still assume it was CPU-limited),
the innocent process did not have any activity anywhere. Next time I'll
also take notice of process state in ps (sadly, my omission).
> I believe you will find that the problem isn't btrfs, but rather, I/O
> contention
This possibility did not come to my mind. Can USB drivers be still that
bad in 4.4? Is there any way to discriminate these two situations (btrfs
vs usb load)?
BTW, USB adapter used is this one (though storage array only supports
USB 3.0): https://www.asus.com/Motherboard-Accessory/USB_31_TYPEA_CARD/
> and that if you try the same thing with one of the
> filesystems being for instance ext4, you'll see the same problem there as
> well
Not sure if it's possible to reproduce the problem with ext4, since it's
not possible to perform such extensive metadata operations there, and
simply moving large amount of data never created any problems for me
regardless of filesystem.
--
With Best Regards,
Marat Khalili
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Do different btrfs volumes compete for CPU?
2017-03-31 12:28 ` Marat Khalili
@ 2017-04-01 2:04 ` Duncan
2017-04-01 10:17 ` Peter Grandi
1 sibling, 0 replies; 9+ messages in thread
From: Duncan @ 2017-04-01 2:04 UTC (permalink / raw)
To: linux-btrfs
Marat Khalili posted on Fri, 31 Mar 2017 15:28:20 +0300 as excerpted:
>> and that if you try the same thing with one of the filesystems being
>> for instance ext4, you'll see the same problem there as well
> Not sure if it's possible to reproduce the problem with ext4, since it's
> not possible to perform such extensive metadata operations there, and
> simply moving large amount of data never created any problems for me
> regardless of filesystem.
Try ext4 as the one hosting the innocent process...
And you said moving large amounts of data never triggered problems, but
were you doing that over USB?
As for knobs I mentioned...
I'm not particularly sure about the knobs on USB, but...
For instance on my old PCI-X (pre-PCIE) server board, the BIOS had a
setting for size of PCI transfer. Given that each transfer has an
effectively fixed overhead and the bus itself has a maximum bandwidth,
the actually reasonably common elsewhere as well tradeoff was between
high thruput (due to lower transfer overhead) at larger transfer sizes,
but at the expense of interactivity and other processes having to wait
for the transfer to complete, and better interactivity and shorter waits
on a full bus at lower transfer sizes, at the expense of thruput due to
higher transfer overhead.
I was having trouble with music cutouts and tried various Linux and ALSA
settings to no avail, but once I set the BIOS to a much lower PCI
transfer size, everything functioned much more smoothly, not just the
music, but the mouse, less waiting on disk reads (because the writes were
shorter), etc.
I /think/ the USB knobs are all in the kernel, but believe there's
similar transfer size knobs there, if you know where to look.
Beyond that, there's more generic IO knobs as listed below, but if it was
CPU not IO blocking, then they might not help in this context, but it's
worth knowing about them, particularly the dirty_* stuff mentioned last,
anyway. (USB is much more CPU intensive than most transfer buses, one
reason Intel pushed it so hard as opposed to say firewire, which offloads
far more to the bus hardware and thus isn't as CPU intensive. So the USB
knobs may well be worth investigating even if it was CPU. I just wish I
knew more about them.)
There's also the IO-scheduler. CFQ has long been the default, but you
might try deadline, and there's now multiqueue-deadline (aka MQ deadline)
as well. NoOp is occasionally recommended for certain SSD use-cases, but
it's not appropriate for spinning rust. Of course most of the schedulers
have detail knobs you can twist too, but I'm not sufficiently
knowledgeable about those to say much about them.
And 4.10 introduced the block-device writeback throttling global option
(BLK_WBT) along with separate options underneath it for single-queue and
multi-queue writeback throttling. I turned those on here, but as most of
my system's on fast ssd, I didn't notice, nor did I expect to notice,
much difference. However, in theory it could make quite some difference
with USB-based storage, particularly slow thumb-drives and spinning rust.
Last but certainly not least as it can make quite a difference, and
indeed did make a difference here back when I was on spinning rust,
there's the dirty-data write-caching typically configured via the
distro's sysctrl mechanism, but which can be manually configured via
the /proc/sys/vm/dirty_* files. The writeback-throttling features
mentioned above may eventually reduce the need to tweak these, but until
they're in commonly deployed kernels, tweaking these settings can make
QUITE a big difference, because the percentage-of-RAM defaults were
configured back in the day when 64 MB of RAM was big, and they simply
aren't appropriate to modern systems with often double-digit GiB RAM.
I'll skip the details here as there's plenty of writeups on the web about
tweaking these, as well as kernel text-file documentation, but you may
want to look into this if you haven't, because as I said it can make a
HUGE difference in effective system interactivity.
That's what I know of. I'd be a lot more comfortable with things if
someone else had confirmed my original post as I'm not a dev, just a
btrfs user and list regular, but I do know we've not had a lot of reports
of this sort of problem posted, and when we have in the past and it was
actually separate btrfss, it turned out it was /not/ btrfs, so I'm
/reasonably/ sure about it. I also run multiple btrfs here and haven't
seen the issue, but they're all on the same pair of partitioned quite
fast ssds on SATA, so the comparison is admittedly of highly limited
value.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Do different btrfs volumes compete for CPU?
2017-03-31 12:28 ` Marat Khalili
2017-04-01 2:04 ` Duncan
@ 2017-04-01 10:17 ` Peter Grandi
2017-04-03 8:02 ` Marat Khalili
1 sibling, 1 reply; 9+ messages in thread
From: Peter Grandi @ 2017-04-01 10:17 UTC (permalink / raw)
To: linux-btrfs
>> Approximately 16 hours ago I've run a script that deleted
>> >~100 snapshots and started quota rescan on a large
>> USB-connected btrfs volume (5.4 of 22 TB occupied now).
That "USB-connected is a rather bad idea. On the IRC channel
#Btrfs whenever someone reports odd things happening I ask "is
that USB?" and usually it is and then we say "good luck!" :-).
The issues are:
* The USB mass storage protocol is poorly designed in particular
for error handling.
* The underlying USB protocol is very CPU intensive.
* Most importantly nearly all USB chipsets, both system-side
and peripheral-side, are breathtakingly buggy, but this does
not get noticed for most USB devices.
>> Quota rescan only completed just now, with 100% load from
>> [btrfs-transacti] throughout this period,
> [ ... ] are different btrfs volumes independent in terms of
> CPU, or are there some shared workers that can be point of
> contention?
As written that question is meaningless: despite the current
mania for "threads"/"threadlets" a filesystem driver is a
library, not a set of processes (all those '[btrfs-*]'
threadlets are somewhat misguided ways to do background
stuff).
The real problems here are:
* Qgroups are famously system CPU intensive, even if less so
than in earlier releases, especially with subvolumes, so the
16 hours CPU is both absurd and expected. I think that qgroups
are still effectively unusable.
* The scheduler gives excessive priority to kernel threads, so
they can crowd out user processes. When for whatever reason
the system CPU percentage rises everything else usually
suffers.
> BTW, USB adapter used is this one (though storage array only
> supports USB 3.0):
> https://www.asus.com/Motherboard-Accessory/USB_31_TYPEA_CARD/
Only Intel/AMD USB chipsets and a few others are fairly
reliable, and for mass storage only with USB3 with UASPI, which
is basically SATA-over-USB (more precisely SCSI-command-set over
USB). Your system-side card seems to be recent enough to do
UASPI, but probably the peripheral-side chipset isn't. Things
are so bad with third-party chipsets that even several types of
add-on SATA and SAS cards are too buggy.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Do different btrfs volumes compete for CPU?
2017-04-01 10:17 ` Peter Grandi
@ 2017-04-03 8:02 ` Marat Khalili
2017-04-04 17:36 ` Peter Grandi
0 siblings, 1 reply; 9+ messages in thread
From: Marat Khalili @ 2017-04-03 8:02 UTC (permalink / raw)
To: Peter Grandi, linux-btrfs
On 01/04/17 13:17, Peter Grandi wrote:
> That "USB-connected is a rather bad idea. On the IRC channel
> #Btrfs whenever someone reports odd things happening I ask "is
> that USB?" and usually it is and then we say "good luck!" :-).
You're right, but USB/eSATA arrays are dirt cheap in comparison with
similar-performance SAN/NAS etc. things, that we unfortunately cannot
really afford here.
Just a bit of a back-story: I tried to use eSATA and ext4 first, but
observed silent data corruption and irrecoverable kernel hangs --
apparently, SATA is not really designed for external use. That's when I
switched to both USB and, coincidently, btrfs, and stability became
orders of magnitude better even on re-purposed consumer-grade PC (Z77
chipset, 3rd gen. i5) with horribly outdated kernel. Now I'm rebuilding
same configuration on server-grade hardware (C610 chipset, 40 io-channel
Xeon) and modern kernel, and thus would be very surprised to find
problems in USB throughput.
> As written that question is meaningless: despite the current
> mania for "threads"/"threadlets" a filesystem driver is a
> library, not a set of processes (all those '[btrfs-*]'
> threadlets are somewhat misguided ways to do background
> stuff).
But these threadlets, misguided as the are, do exist, don't they?
> * Qgroups are famously system CPU intensive, even if less so
> than in earlier releases, especially with subvolumes, so the
> 16 hours CPU is both absurd and expected. I think that qgroups
> are still effectively unusable.
I understand that qgroups is very much work in progress, but (correct me
if I'm wrong) right now it's the only way to estimate real usage of
subvolume and its snapshots. For instance, if I have dozen 1TB
subvolumes each having ~50 snapshots and suddenly run out of space on a
24TB volume, how do I find the culprit without qgroups? Keeping eye on
storage use is essential for any real life use of snapshots, and they
are too convenient as backup de-duplication tool to give up.
Just a stray thought: btrfs seem to lack object type in between of
volume and subvolume, that would keep track of storage use by several
subvolumes+their snapshots, allow snapshotting/transferring multiple
subvolumes at once etc. Some kind of super-subvolume (supervolume?) that
is hierarchical. With increasing use of subvolumes/snapshots within a
single system installation, and multiple system installations (belonging
to different users) in one volume due to liberal use of LXC and similar
technologies this will become more and more of a pressing problem.
> * The scheduler gives excessive priority to kernel threads, so
> they can crowd out user processes. When for whatever reason
> the system CPU percentage rises everything else usually
> suffers.
I thought it was clear, but probably needs spelling out: while 1 core
was completely occupied with [btrfs-transacti] thread, 5 more were
mostly idle serving occasional network requests without any problems.
And only a process that used storage intensively died. Fortunately or
not, it's the only data point so far -- smaller snapshot cullings do not
cause problems.
> Only Intel/AMD USB chipsets and a few others are fairly
> reliable, and for mass storage only with USB3 with UASPI, which
> is basically SATA-over-USB (more precisely SCSI-command-set over
> USB). Your system-side card seems to be recent enough to do
> UASPI, but probably the peripheral-side chipset isn't. Things
> are so bad with third-party chipsets that even several types of
> add-on SATA and SAS cards are too buggy.
Thank you very much for this hint. The card is indeed unknown factor
here and I'll keep a close eye on it. The chip is ASM1142, not Intel/AMD
sadly but quite popular nevertheless.
--
With Best Regards,
Marat Khalili
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Do different btrfs volumes compete for CPU?
2017-04-03 8:02 ` Marat Khalili
@ 2017-04-04 17:36 ` Peter Grandi
2017-04-05 7:04 ` Marat Khalili
0 siblings, 1 reply; 9+ messages in thread
From: Peter Grandi @ 2017-04-04 17:36 UTC (permalink / raw)
To: Linux fs Btrfs
> [ ... ] I tried to use eSATA and ext4 first, but observed
> silent data corruption and irrecoverable kernel hangs --
> apparently, SATA is not really designed for external use.
SATA works for external use, eSATA works well, but what really
matters is the chipset of the adapter card.
In my experience JMicron is not so good, Marvell a bit better,
best is to use a recent motherboard chipset with a SATA-eSATA
internal cable and bracket.
>> As written that question is meaningless: despite the current
>> mania for "threads"/"threadlets" a filesystem driver is a
>> library, not a set of processes (all those '[btrfs-*]'
>> threadlets are somewhat misguided ways to do background
>> stuff).
> But these threadlets, misguided as the are, do exist, don't
> they?
But that does not change the fact that it is a library and work
is initiated by user requests which are not per-subvolume, but
in effect per-volume.
> I understand that qgroups is very much work in progress, but
> (correct me if I'm wrong) right now it's the only way to
> estimate real usage of subvolume and its snapshots.
It is a way to do so and not a very good way. There is no
obviously good way to define "real usage" in the presence of
hard-links and reflinking, and qgroups use just one way to
define it. A similar problem happens with processes in the
presence of shared pages, multiple mapped shared libraries etc.
> For instance, if I have dozen 1TB subvolumes each having ~50
> snapshots and suddenly run out of space on a 24TB volume, how
> do I find the culprit without qgroups?
It is not clear what "culprit" means here. The problem is that
both hard-links and ref-linking create really significant
ambiguities as to used space. Plus the same problem would happen
with directories instead of subvolumes and hard-links instead of
reflinked snapshots.
> [ ... ] The chip is ASM1142, not Intel/AMD sadly but quite
> popular nevertheless.
ASMedia USB3 chipsets are fairly reliable at the least the card
ones on the system side. The ones on the disk side I don't know
much about. I have seen some ASMedia one that also seem OK. For
the disks I use a Seagate and a WDC external box from which I
have removed the original disk, as I have noticed that Seagate
and WDC for obvious reasons tend to test and use the more
reliable chipsets. I have also got an external USB3 dock with a
recent ASMedia chipset that also seems good, but I haven't used
it much.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Do different btrfs volumes compete for CPU?
2017-04-04 17:36 ` Peter Grandi
@ 2017-04-05 7:04 ` Marat Khalili
2017-04-07 0:17 ` Martin
0 siblings, 1 reply; 9+ messages in thread
From: Marat Khalili @ 2017-04-05 7:04 UTC (permalink / raw)
To: Linux fs Btrfs
On 04/04/17 20:36, Peter Grandi wrote:
> SATA works for external use, eSATA works well, but what really
> matters is the chipset of the adapter card.
eSATA might be sound electrically, but mechanically it is awful. Try to
run it for months in a crowded server room, and inevitably you'll get
disconnections and data corruption. Tried different cables, brackets --
same result. If you ever used eSATA connector, you'd feel it.
> In my experience JMicron is not so good, Marvell a bit better,
> best is to use a recent motherboard chipset with a SATA-eSATA
> internal cable and bracket.
That's exactly what I used to use: internal controller of Z77 chipset +
bracket(s).
> But that does not change the fact that it is a library and work
> is initiated by user requests which are not per-subvolume, but
> in effect per-volume.
That's the answer I was looking for.
> It is a way to do so and not a very good way. There is no
> obviously good way to define "real usage" in the presence of
> hard-links and reflinking, and qgroups use just one way to
> define it. A similar problem happens with processes in the
> presence of shared pages, multiple mapped shared libraries etc.
No need to over-generalize. There's an obvious good way to define "real
usage" of a subvolume and its snapshots as long as it don't share any
data with other subvolumes, as is often the case. If it does share, two
figures -- exclusive and referenced, like in qgroups -- are sufficient
for most tasks.
> The problem is that
> both hard-links and ref-linking create really significant
> ambiguities as to used space. Plus the same problem would happen
> with directories instead of subvolumes and hard-links instead of
> reflinked snapshots.
You're right, although with hard-links there's at least remote chance to
estimate storage use with usermode scripts.
> ASMedia USB3 chipsets are fairly reliable at the least the card
> ones on the system side. The ones on the disk side I don't know
> much about.
This is getting increasingly off-topic, but our mainstay are CFI 5-disk
DAS boxes (8253JDGG to be exact) filled with WD Red-s in RAID5
configuration. They are no longer produced and getting harder and harder
to source, but showed themselves as very reliable. According to lsusb
they contain JMicron JMS567 SATA 6Gb/s bridge.
--
With Best Regards,
Marat Khalili
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Do different btrfs volumes compete for CPU?
2017-04-05 7:04 ` Marat Khalili
@ 2017-04-07 0:17 ` Martin
0 siblings, 0 replies; 9+ messages in thread
From: Martin @ 2017-04-07 0:17 UTC (permalink / raw)
To: linux-btrfs
On 05/04/17 08:04, Marat Khalili wrote:
> On 04/04/17 20:36, Peter Grandi wrote:
>> SATA works for external use, eSATA works well, but what really
>> matters is the chipset of the adapter card.
> eSATA might be sound electrically, but mechanically it is awful. Try to
> run it for months in a crowded server room, and inevitably you'll get
> disconnections and data corruption. Tried different cables, brackets --
> same result. If you ever used eSATA connector, you'd feel it.
Been using eSATA here for multiple disk packs continuously connected for
a few years now for 48TB of data (not enough room in the host for the
disks).
Never suffered am eSATA disconnect.
Had the usual cooling fan fails and HDD fails due to old age.
All just a case of ensuring undisturbed clean cabling and a good UPS?...
(BTRFS spanning four disks per external pack has worked well also.)
Good luck,
Martin
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2017-04-07 0:18 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-31 7:05 Do different btrfs volumes compete for CPU? Marat Khalili
2017-03-31 11:49 ` Duncan
2017-03-31 12:28 ` Marat Khalili
2017-04-01 2:04 ` Duncan
2017-04-01 10:17 ` Peter Grandi
2017-04-03 8:02 ` Marat Khalili
2017-04-04 17:36 ` Peter Grandi
2017-04-05 7:04 ` Marat Khalili
2017-04-07 0:17 ` Martin
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.