This might be a simpler way to control the number of threads at the same time.

On large machines (cpu wise, memory wise and disk wise). I have only seen lvm timeout when udev_children is set to default. The default seems to be set wrong, and the default seemed to be tuned for a case where a large number of the disks on the machine were going to be timing out (or otherwise really really slow), so to support this case a huge number of threads was required.. I found that with it set to default on a close to 100 core machine that udev got about 87 minutes of time during the boot up (about 2 minutes). Changing the number of children to =4 resulted in udev getting around 2-3 minutes in the same window, and actually resulted in a much faster boot up and a much more reliable boot up (no timeouts). We experienced these timeouts on a number of the larger machines (70 cores or more) before we debugged and determined what was going on. It would appear that the udev threads on the giant machines with a lot of disk are overwhelming each other in some sort of tight (either process creating system time or some other resource constraint) loop and causing contention and doing very little useful work.

Just an observation, this may have nothing to do with what you have going on, but what you are describing sounds very close to what I debugged. We were doing "ps axuwwS | grep -i udev" just after boot up to determine how much cpu time udev was getting during boot up, and determined that as we lowered the children the time got less, and the boot up got faster and stopped timing out. And since udev was getting 90 minutes in an elapsed time of around 120 seconds, it had to be using a significant number of threads during boot up. I believe these same udev threads call the pvscans.

Below is one case, but I know there are several other similar cases for other distributions. Note the number of default workers = 8 + number_of_cpus * 64 which is going to be a disaster as it will result in one thread per disk/lun being started at the same time or the max_number_of_workers. Either of which will result in a high degree of non-productive useless system contention on a machine with a significant number of luns.

https://www.suse.com/support/kb/doc/?id=000019156

On Sun, Jun 6, 2021 at 1:16 AM heming.zhao@suse.com <heming.zhao@suse.com> wrote:

Hello David & Zdenek,

I send this mail for a well known performance issue:
when system is attached huge numbers of devices. (ie. 1000+ disks),
the lvm2-pvscan@.service costs too much time and systemd is very easy to
time out, and enter emergency shell in the end.

This performance topic had been discussed in there some times, and the issue was
lasting for many years. From the lvm2 latest code, this issue still can't be fix
completely. The latest code add new function _pvscan_aa_quick(), which makes the
booting time largely reduce but still can's fix this issue utterly.

In my test env, x86 qemu-kvm machine, 6vcpu, 22GB mem, 1015 pv/vg/lv, comparing
with/without _pvscan_aa_quick() code, booting time reduce from "9min 51s" to
"2min 6s". But after switching to direct activation, the booting time is 8.7s
(for longest lvm2 services: lvm2-activation-early.service).

The hot spot of event activation is dev_cache_scan, which time complexity is
O(n^2). And at the same time, systemd-udev worker will generate/run
lvm2-pvscan@.service on all detecting disks. So the overall is O(n^3).

```
dev_cache_scan //order: O(n^2)
+ _insert_dirs //O(n)
| if obtain_device_list_from_udev() true
| _insert_udev_dir //O(n)
|
+ dev_cache_index_devs //O(n)

There are 'n' lvm2-pvscan@.service running: O(n)
Overall: O(n) * O(n^2) => O(n^3)
```

Question/topic:
Could we find out a final solution to have a good performance & scale well under
event-based activation?

Maybe two solutions (Martin & I discussed):

1. During boot phase, lvm2 automatically swithes to direct activation mode
("event_activation = 0"). After booted, switch back to the event activation mode.

Booting phase is a speical stage. *During boot*, we could "pretend" that direct
activation (event_activation=0) is set, and rely on lvm2-activation-*.service
for PV detection. Once lvm2-activation-net.service has finished, we could
"switch on" event activation.

More precisely: pvscan --cache would look at some file under /run,
e.g. /run/lvm2/boot-finished, and quit immediately if the file doesn't exist
(as if event_activation=0 was set). In lvm2-activation-net.service, we would add
something like:

```
ExecStartPost=/bin/touch /run/lvm2/boot-finished
```

... so that, from this point in time onward, "pvscan --cache" would _not_ quit
immediately any more, but run normally (assuming that the global
event_activation setting is 1). This way we'd get the benefit of using the
static activation services during boot (good performance) while still being able
to react to udev events after booting has finished.

This idea would be worked out with very few code changes.
The result would be a huge step forward on booting time.

2. change lvm2-pvscan@.service running mode from parallel to serival.

This idea looks a little weird, it goes the opposite trend of today's
programming technologies: parallel programming on multi-cores.

idea:
the action of lvm2 scaning "/dev" is hard to change, the outside parallel
lvm2-pvscan@.service could change from parallel to serial.

For example, a running pvscan instance could set a "running" flag in tmpfs (ie.
/run/lvm/) indicating that no other pvscan process should be called in parallel.
If another pvscan is invoked and sees "running", it would create a "pending"
flag, and quit. Any other pvscan process seeing the "pending" flag would
just quit. If the first instance sees the "pending" flag, it would
atomically remove "pending" and restart itself, in order to catch any device
that might have appeared since the previous sysfs scan.
In most condition, devices had been found by once pvscan scanning,
then next time of pvscan scanning should work with order O(n), because the
target device had been inserted internal cache tree already. and on overall,
there is only a single pvscan process would be running at any given time.

We could create a list of pending to-be-scanned devices then (might be directory
entries in some tmpfs directory). On exit, pvscan could check this dir and
restart if it's non-empty.

Thanks
Heming

_______________________________________________
linux-lvm mailing list
linux-lvm@redhat.com
https://listman.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/