All of lore.kernel.org
 help / color / mirror / Atom feed
* Do different btrfs volumes compete for CPU?
@ 2017-03-31  7:05 Marat Khalili
  2017-03-31 11:49 ` Duncan
  0 siblings, 1 reply; 9+ messages in thread
From: Marat Khalili @ 2017-03-31  7:05 UTC (permalink / raw)
  To: linux-btrfs

Approximately 16 hours ago I've run a script that deleted >~100 
snapshots and started quota rescan on a large USB-connected btrfs volume 
(5.4 of 22 TB occupied now). Quota rescan only completed just now, with 
100% load from [btrfs-transacti] throughout this period, which is 
probably ~ok depending on your view on things.

What worries me is innocent process using _another_, SATA-connected 
btrfs volume that hung right after I started my script and took >30 
minutes to be sigkilled. There's nothing interesting in the kernel log, 
and attempts to attach strace to the process output nothing, but I of 
course suspect that it freezed on disk operation.

I wonder:
1) Can there be a contention for CPU or some mutexes between kernel 
btrfs threads belonging to different volumes?
2) If yes, can anything be done about it other than mounting volumes 
from (different) VMs?


> $ uname -a; btrfs --version
> Linux host 4.4.0-66-generic #87-Ubuntu SMP Fri Mar 3 15:29:05 UTC 2017 
> x86_64 x86_64 x86_64 GNU/Linux
> btrfs-progs v4.4

--

With Best Regards,
Marat Khalili


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Do different btrfs volumes compete for CPU?
  2017-03-31  7:05 Do different btrfs volumes compete for CPU? Marat Khalili
@ 2017-03-31 11:49 ` Duncan
  2017-03-31 12:28   ` Marat Khalili
  0 siblings, 1 reply; 9+ messages in thread
From: Duncan @ 2017-03-31 11:49 UTC (permalink / raw)
  To: linux-btrfs

Marat Khalili posted on Fri, 31 Mar 2017 10:05:20 +0300 as excerpted:

> Approximately 16 hours ago I've run a script that deleted >~100
> snapshots and started quota rescan on a large USB-connected btrfs volume
> (5.4 of 22 TB occupied now). Quota rescan only completed just now, with
> 100% load from [btrfs-transacti] throughout this period, which is
> probably ~ok depending on your view on things.
> 
> What worries me is innocent process using _another_, SATA-connected
> btrfs volume that hung right after I started my script and took >30
> minutes to be sigkilled. There's nothing interesting in the kernel log,
> and attempts to attach strace to the process output nothing, but I of
> course suspect that it freezed on disk operation.
> 
> I wonder:
> 1) Can there be a contention for CPU or some mutexes between kernel
> btrfs threads belonging to different volumes?
> 2) If yes, can anything be done about it other than mounting volumes
> from (different) VMs?
> 
> 
>> $ uname -a; btrfs --version
>> Linux host 4.4.0-66-generic #87-Ubuntu SMP
>> Fri Mar 3 15:29:05 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
>> btrfs-progs v4.4

What would have been interesting would have been if you had any reports 
from for instance htop during that time, showing wait percentage on the 
various cores and status (probably D, disk-wait) of the innocent 
process.  iotop output would of course have been even better, but also 
rather more special-case so less commonly installed.

I believe you will find that the problem isn't btrfs, but rather, I/O 
contention, and that if you try the same thing with one of the 
filesystems being for instance ext4, you'll see the same problem there as 
well, which because the two filesystems are then not the same type should 
well demonstrate that it's not a problem at the filesystem level, but 
rather elsewhere.

USB is infamous for being an I/O bottleneck, slowing things down both for 
it, and on less than perfectly configured systems, often for data access 
on other devices as well.  SATA can and does do similar things too, but 
because it tends to be more efficient in general, it doesn't tend to make 
things as drastically bad for as long as USB can.

There's some knobs you can twist for better interactivity, but I need to 
be up to go to work in a couple hours so will leave it to other posters 
to make suggestions in that regard at this point.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Do different btrfs volumes compete for CPU?
  2017-03-31 11:49 ` Duncan
@ 2017-03-31 12:28   ` Marat Khalili
  2017-04-01  2:04     ` Duncan
  2017-04-01 10:17     ` Peter Grandi
  0 siblings, 2 replies; 9+ messages in thread
From: Marat Khalili @ 2017-03-31 12:28 UTC (permalink / raw)
  To: linux-btrfs

Thank you very much for reply and suggestions, more comments below. 
Still, is there a definite answer on root question: are different btrfs 
volumes independent in terms of CPU, or are there some shared workers 
that can be point of contention?

> What would have been interesting would have been if you had any reports
> from for instance htop during that time, showing wait percentage on the
> various cores and status (probably D, disk-wait) of the innocent
> process.  iotop output would of course have been even better, but also
> rather more special-case so less commonly installed.
Curiously, I have had iotop but not htop running. [btrfs-transacti] had 
some low-level activity in iotop (I still assume it was CPU-limited), 
the innocent process did not have any activity anywhere. Next time I'll 
also take notice of process state in ps (sadly, my omission).

> I believe you will find that the problem isn't btrfs, but rather, I/O
> contention
This possibility did not come to my mind. Can USB drivers be still that 
bad in 4.4? Is there any way to discriminate these two situations (btrfs 
vs usb load)?

BTW, USB adapter used is this one (though storage array only supports 
USB 3.0): https://www.asus.com/Motherboard-Accessory/USB_31_TYPEA_CARD/

> and that if you try the same thing with one of the
> filesystems being for instance ext4, you'll see the same problem there as
> well
Not sure if it's possible to reproduce the problem with ext4, since it's 
not possible to perform such extensive metadata operations there, and 
simply moving large amount of data never created any problems for me 
regardless of filesystem.

--

With Best Regards,
Marat Khalili


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Do different btrfs volumes compete for CPU?
  2017-03-31 12:28   ` Marat Khalili
@ 2017-04-01  2:04     ` Duncan
  2017-04-01 10:17     ` Peter Grandi
  1 sibling, 0 replies; 9+ messages in thread
From: Duncan @ 2017-04-01  2:04 UTC (permalink / raw)
  To: linux-btrfs

Marat Khalili posted on Fri, 31 Mar 2017 15:28:20 +0300 as excerpted:

>> and that if you try the same thing with one of the filesystems being
>> for instance ext4, you'll see the same problem there as well

> Not sure if it's possible to reproduce the problem with ext4, since it's
> not possible to perform such extensive metadata operations there, and
> simply moving large amount of data never created any problems for me
> regardless of filesystem.

Try ext4 as the one hosting the innocent process...

And you said moving large amounts of data never triggered problems, but 
were you doing that over USB?

As for knobs I mentioned...

I'm not particularly sure about the knobs on USB, but...

For instance on my old PCI-X (pre-PCIE) server board, the BIOS had a 
setting for size of PCI transfer.  Given that each transfer has an 
effectively fixed overhead and the bus itself has a maximum bandwidth, 
the actually reasonably common elsewhere as well tradeoff was between 
high thruput (due to lower transfer overhead) at larger transfer sizes, 
but at the expense of interactivity and other processes having to wait 
for the transfer to complete, and better interactivity and shorter waits 
on a full bus at lower transfer sizes, at the expense of thruput due to 
higher transfer overhead.

I was having trouble with music cutouts and tried various Linux and ALSA 
settings to no avail, but once I set the BIOS to a much lower PCI 
transfer size, everything functioned much more smoothly, not just the 
music, but the mouse, less waiting on disk reads (because the writes were 
shorter), etc.

I /think/ the USB knobs are all in the kernel, but believe there's 
similar transfer size knobs there, if you know where to look.

Beyond that, there's more generic IO knobs as listed below, but if it was 
CPU not IO blocking, then they might not help in this context, but it's 
worth knowing about them, particularly the dirty_* stuff mentioned last, 
anyway.  (USB is much more CPU intensive than most transfer buses, one 
reason Intel pushed it so hard as opposed to say firewire, which offloads 
far more to the bus hardware and thus isn't as CPU intensive.  So the USB 
knobs may well be worth investigating even if it was CPU.  I just wish I 
knew more about them.)

There's also the IO-scheduler.  CFQ has long been the default, but you 
might try deadline, and there's now multiqueue-deadline (aka MQ deadline) 
as well.  NoOp is occasionally recommended for certain SSD use-cases, but 
it's not appropriate for spinning rust.  Of course most of the schedulers 
have detail knobs you can twist too, but I'm not sufficiently 
knowledgeable about those to say much about them.

And 4.10 introduced the block-device writeback throttling global option 
(BLK_WBT) along with separate options underneath it for single-queue and 
multi-queue writeback throttling.  I turned those on here, but as most of 
my system's on fast ssd, I didn't notice, nor did I expect to notice, 
much difference.  However, in theory it could make quite some difference 
with USB-based storage, particularly slow thumb-drives and spinning rust.

Last but certainly not least as it can make quite a difference, and 
indeed did make a difference here back when I was on spinning rust, 
there's the dirty-data write-caching typically configured via the 
distro's sysctrl mechanism, but which can be manually configured via 
the /proc/sys/vm/dirty_* files.  The writeback-throttling features 
mentioned above may eventually reduce the need to tweak these, but until 
they're in commonly deployed kernels, tweaking these settings can make 
QUITE a big difference, because the percentage-of-RAM defaults were 
configured back in the day when 64 MB of RAM was big, and they simply 
aren't appropriate to modern systems with often double-digit GiB RAM.  
I'll skip the details here as there's plenty of writeups on the web about 
tweaking these, as well as kernel text-file documentation, but you may 
want to look into this if you haven't, because as I said it can make a 
HUGE difference in effective system interactivity.


That's what I know of.  I'd be a lot more comfortable with things if 
someone else had confirmed my original post as I'm not a dev, just a 
btrfs user and list regular, but I do know we've not had a lot of reports 
of this sort of problem posted, and when we have in the past and it was 
actually separate btrfss, it turned out it was /not/ btrfs, so I'm 
/reasonably/ sure about it.  I also run multiple btrfs here and haven't 
seen the issue, but they're all on the same pair of partitioned quite 
fast ssds on SATA, so the comparison is admittedly of highly limited 
value.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Do different btrfs volumes compete for CPU?
  2017-03-31 12:28   ` Marat Khalili
  2017-04-01  2:04     ` Duncan
@ 2017-04-01 10:17     ` Peter Grandi
  2017-04-03  8:02       ` Marat Khalili
  1 sibling, 1 reply; 9+ messages in thread
From: Peter Grandi @ 2017-04-01 10:17 UTC (permalink / raw)
  To: linux-btrfs

>> Approximately 16 hours ago I've run a script that deleted
>> >~100 snapshots and started quota rescan on a large
>> USB-connected btrfs volume (5.4 of 22 TB occupied now).

That "USB-connected is a rather bad idea. On the IRC channel
#Btrfs whenever someone reports odd things happening I ask "is
that USB?" and usually it is and then we say "good luck!" :-).

The issues are:

* The USB mass storage protocol is poorly designed in particular
  for error handling.
* The underlying USB protocol is very CPU intensive.
* Most importantly nearly all USB chipsets, both system-side
  and peripheral-side, are breathtakingly buggy, but this does
  not get noticed for most USB devices.

>> Quota rescan only completed just now, with 100% load from
>> [btrfs-transacti] throughout this period,

> [ ... ] are different btrfs volumes independent in terms of
> CPU, or are there some shared workers that can be point of
> contention?

As written that question is meaningless: despite the current
mania for "threads"/"threadlets" a filesystem driver is a
library, not a set of processes (all those '[btrfs-*]'
threadlets are somewhat misguided ways to do background
stuff).

The real problems here are:

* Qgroups are famously system CPU intensive, even if less so
  than in earlier releases, especially with subvolumes, so the
  16 hours CPU is both absurd and expected. I think that qgroups
  are still effectively unusable.
* The scheduler gives excessive priority to kernel threads, so
  they can crowd out user processes. When for whatever reason
  the system CPU percentage rises everything else usually
  suffers.

> BTW, USB adapter used is this one (though storage array only
> supports USB 3.0):
> https://www.asus.com/Motherboard-Accessory/USB_31_TYPEA_CARD/

Only Intel/AMD USB chipsets and a few others are fairly
reliable, and for mass storage only with USB3 with UASPI, which
is basically SATA-over-USB (more precisely SCSI-command-set over
USB). Your system-side card seems to be recent enough to do
UASPI, but probably the peripheral-side chipset isn't. Things
are so bad with third-party chipsets that even several types of
add-on SATA and SAS cards are too buggy.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Do different btrfs volumes compete for CPU?
  2017-04-01 10:17     ` Peter Grandi
@ 2017-04-03  8:02       ` Marat Khalili
  2017-04-04 17:36         ` Peter Grandi
  0 siblings, 1 reply; 9+ messages in thread
From: Marat Khalili @ 2017-04-03  8:02 UTC (permalink / raw)
  To: Peter Grandi, linux-btrfs

On 01/04/17 13:17, Peter Grandi wrote:
> That "USB-connected is a rather bad idea. On the IRC channel
> #Btrfs whenever someone reports odd things happening I ask "is
> that USB?" and usually it is and then we say "good luck!" :-).
You're right, but USB/eSATA arrays are dirt cheap in comparison with 
similar-performance SAN/NAS etc. things, that we unfortunately cannot 
really afford here.

Just a bit of a back-story: I tried to use eSATA and ext4 first, but 
observed silent data corruption and irrecoverable kernel hangs -- 
apparently, SATA is not really designed for external use. That's when I 
switched to both USB and, coincidently, btrfs, and stability became 
orders of magnitude better even on re-purposed consumer-grade PC (Z77 
chipset, 3rd gen. i5) with horribly outdated kernel. Now I'm rebuilding 
same configuration on server-grade hardware (C610 chipset, 40 io-channel 
Xeon) and modern kernel, and thus would be very surprised to find 
problems in USB throughput.

> As written that question is meaningless: despite the current
> mania for "threads"/"threadlets" a filesystem driver is a
> library, not a set of processes (all those '[btrfs-*]'
> threadlets are somewhat misguided ways to do background
> stuff).
But these threadlets, misguided as the are, do exist, don't they?

> * Qgroups are famously system CPU intensive, even if less so
>    than in earlier releases, especially with subvolumes, so the
>    16 hours CPU is both absurd and expected. I think that qgroups
>    are still effectively unusable.
I understand that qgroups is very much work in progress, but (correct me 
if I'm wrong) right now it's the only way to estimate real usage of 
subvolume and its snapshots. For instance, if I have dozen 1TB 
subvolumes each having ~50 snapshots and suddenly run out of space on a 
24TB volume, how do I find the culprit without qgroups? Keeping eye on 
storage use is essential for any real life use of snapshots, and they 
are too convenient as backup de-duplication tool to give up.

Just a stray thought:  btrfs seem to lack object type in between of 
volume and subvolume, that would keep track of storage use by several 
subvolumes+their snapshots, allow snapshotting/transferring multiple 
subvolumes at once etc. Some kind of super-subvolume (supervolume?) that 
is hierarchical. With increasing use of subvolumes/snapshots within a 
single system installation, and multiple system installations (belonging 
to different users) in one volume due to liberal use of LXC and similar 
technologies this will become more and more of a pressing problem.

> * The scheduler gives excessive priority to kernel threads, so
>    they can crowd out user processes. When for whatever reason
>    the system CPU percentage rises everything else usually
>    suffers.
I thought it was clear, but probably needs spelling out: while 1 core 
was completely occupied with [btrfs-transacti] thread, 5 more were 
mostly idle serving occasional network requests without any problems. 
And only a process that used storage intensively died. Fortunately or 
not, it's the only data point so far -- smaller snapshot cullings do not 
cause problems.

> Only Intel/AMD USB chipsets and a few others are fairly
> reliable, and for mass storage only with USB3 with UASPI, which
> is basically SATA-over-USB (more precisely SCSI-command-set over
> USB). Your system-side card seems to be recent enough to do
> UASPI, but probably the peripheral-side chipset isn't. Things
> are so bad with third-party chipsets that even several types of
> add-on SATA and SAS cards are too buggy.
Thank you very much for this hint. The card is indeed unknown factor 
here and I'll keep a close eye on it. The chip is ASM1142, not Intel/AMD 
sadly but quite popular nevertheless.

--

With Best Regards,
Marat Khalili


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Do different btrfs volumes compete for CPU?
  2017-04-03  8:02       ` Marat Khalili
@ 2017-04-04 17:36         ` Peter Grandi
  2017-04-05  7:04           ` Marat Khalili
  0 siblings, 1 reply; 9+ messages in thread
From: Peter Grandi @ 2017-04-04 17:36 UTC (permalink / raw)
  To: Linux fs Btrfs

> [ ... ] I tried to use eSATA and ext4 first, but observed
> silent data corruption and irrecoverable kernel hangs --
> apparently, SATA is not really designed for external use.

SATA works for external use, eSATA works well, but what really
matters is the chipset of the adapter card.

In my experience JMicron is not so good, Marvell a bit better,
best is to use a recent motherboard chipset with a SATA-eSATA
internal cable and bracket.

>> As written that question is meaningless: despite the current
>> mania for "threads"/"threadlets" a filesystem driver is a
>> library, not a set of processes (all those '[btrfs-*]'
>> threadlets are somewhat misguided ways to do background
>> stuff).

> But these threadlets, misguided as the are, do exist, don't
> they?

But that does not change the fact that it is a library and work
is initiated by user requests which are not per-subvolume, but
in effect per-volume.

> I understand that qgroups is very much work in progress, but
> (correct me if I'm wrong) right now it's the only way to
> estimate real usage of subvolume and its snapshots.

It is a way to do so and not a very good way. There is no
obviously good way to define "real usage" in the presence of
hard-links and reflinking, and qgroups use just one way to
define it. A similar problem happens with processes in the
presence of shared pages, multiple mapped shared libraries etc.

> For instance, if I have dozen 1TB subvolumes each having ~50
> snapshots and suddenly run out of space on a 24TB volume, how
> do I find the culprit without qgroups?

It is not clear what "culprit" means here. The problem is that
both hard-links and ref-linking create really significant
ambiguities as to used space. Plus the same problem would happen
with directories instead of subvolumes and hard-links instead of
reflinked snapshots.

> [ ... ] The chip is ASM1142, not Intel/AMD sadly but quite
> popular nevertheless.

ASMedia USB3 chipsets are fairly reliable at the least the card
ones on the system side. The ones on the disk side I don't know
much about. I have seen some ASMedia one that also seem OK. For
the disks I use a Seagate and a WDC external box from which I
have removed the original disk, as I have noticed that Seagate
and WDC for obvious reasons tend to test and use the more
reliable chipsets. I have also got an external USB3 dock with a
recent ASMedia chipset that also seems good, but I haven't used
it much.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Do different btrfs volumes compete for CPU?
  2017-04-04 17:36         ` Peter Grandi
@ 2017-04-05  7:04           ` Marat Khalili
  2017-04-07  0:17             ` Martin
  0 siblings, 1 reply; 9+ messages in thread
From: Marat Khalili @ 2017-04-05  7:04 UTC (permalink / raw)
  To: Linux fs Btrfs

On 04/04/17 20:36, Peter Grandi wrote:
> SATA works for external use, eSATA works well, but what really
> matters is the chipset of the adapter card.
eSATA might be sound electrically, but mechanically it is awful. Try to 
run it for months in a crowded server room, and inevitably you'll get 
disconnections and data corruption. Tried different cables, brackets -- 
same result. If you ever used eSATA connector, you'd feel it.

> In my experience JMicron is not so good, Marvell a bit better,
> best is to use a recent motherboard chipset with a SATA-eSATA
> internal cable and bracket.
That's exactly what I used to use: internal controller of Z77 chipset + 
bracket(s).

> But that does not change the fact that it is a library and work
> is initiated by user requests which are not per-subvolume, but
> in effect per-volume.
That's the answer I was looking for.

> It is a way to do so and not a very good way. There is no
> obviously good way to define "real usage" in the presence of
> hard-links and reflinking, and qgroups use just one way to
> define it. A similar problem happens with processes in the
> presence of shared pages, multiple mapped shared libraries etc.
No need to over-generalize. There's an obvious good way to define "real 
usage" of a subvolume and its snapshots as long as it don't share any 
data with other subvolumes, as is often the case. If it does share, two 
figures -- exclusive and referenced, like in qgroups -- are sufficient 
for most tasks.

> The problem is that
> both hard-links and ref-linking create really significant
> ambiguities as to used space. Plus the same problem would happen
> with directories instead of subvolumes and hard-links instead of
> reflinked snapshots.
You're right, although with hard-links there's at least remote chance to 
estimate storage use with usermode scripts.

> ASMedia USB3 chipsets are fairly reliable at the least the card
> ones on the system side. The ones on the disk side I don't know
> much about.
This is getting increasingly off-topic, but our mainstay are CFI 5-disk 
DAS boxes (8253JDGG to be exact) filled with WD Red-s in RAID5 
configuration. They are no longer produced and getting harder and harder 
to source, but showed themselves as very reliable. According to lsusb 
they contain JMicron JMS567 SATA 6Gb/s bridge.

--

With Best Regards,
Marat Khalili

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Do different btrfs volumes compete for CPU?
  2017-04-05  7:04           ` Marat Khalili
@ 2017-04-07  0:17             ` Martin
  0 siblings, 0 replies; 9+ messages in thread
From: Martin @ 2017-04-07  0:17 UTC (permalink / raw)
  To: linux-btrfs

On 05/04/17 08:04, Marat Khalili wrote:
> On 04/04/17 20:36, Peter Grandi wrote:
>> SATA works for external use, eSATA works well, but what really
>> matters is the chipset of the adapter card.
> eSATA might be sound electrically, but mechanically it is awful. Try to
> run it for months in a crowded server room, and inevitably you'll get
> disconnections and data corruption. Tried different cables, brackets --
> same result. If you ever used eSATA connector, you'd feel it.

Been using eSATA here for multiple disk packs continuously connected for
a few years now for 48TB of data (not enough room in the host for the
disks).

Never suffered am eSATA disconnect.

Had the usual cooling fan fails and HDD fails due to old age.


All just a case of ensuring undisturbed clean cabling and a good UPS?...

(BTRFS spanning four disks per external pack has worked well also.)

Good luck,
Martin



^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2017-04-07  0:18 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-31  7:05 Do different btrfs volumes compete for CPU? Marat Khalili
2017-03-31 11:49 ` Duncan
2017-03-31 12:28   ` Marat Khalili
2017-04-01  2:04     ` Duncan
2017-04-01 10:17     ` Peter Grandi
2017-04-03  8:02       ` Marat Khalili
2017-04-04 17:36         ` Peter Grandi
2017-04-05  7:04           ` Marat Khalili
2017-04-07  0:17             ` Martin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.