Re: [linux-lvm] Discussion: performance issue on event activation mode

From: Roger Heflin <rogerheflin@gmail.com>
To: LVM general discussion and development <linux-lvm@redhat.com>
Cc: Martin Wilck <martin.wilck@suse.com>,
	David Teigland <teigland@redhat.com>,
	Zdenek Kabelac <zkabelac@redhat.com>
Subject: Re: [linux-lvm] Discussion: performance issue on event activation mode
Date: Sun, 6 Jun 2021 11:35:28 -0500	[thread overview]
Message-ID: <CAAMCDef6sSy5X2+-WLR3wx5m7dx9bhmN3EKWyPa31SebS=wRGQ@mail.gmail.com> (raw)
In-Reply-To: <a885bfa7-9635-ba73-c16b-13b9ef7a2aa4@suse.com>

[-- Attachment #1.1: Type: text/plain, Size: 6858 bytes --]

This might be a simpler way to control the number of threads at the same
time.

On large machines (cpu wise, memory wise and disk wise).   I have only seen
lvm timeout when udev_children is set to default.   The default seems to be
set wrong, and the default seemed to be tuned for a case where a large
number of the disks on the machine were going to be timing out (or
otherwise really really slow), so to support this case a huge number of
threads was required..    I found that with it set to default on a close to
100 core machine that udev got about 87 minutes of time during the boot up
(about 2 minutes).  Changing the number of children to =4 resulted in udev
getting around 2-3 minutes in the same window, and actually resulted in a
much faster boot up and a much more reliable boot up (no timeouts).   We
experienced these timeouts on a number of the larger machines (70 cores or
more) before we debugged and determined what was going on.   It would
appear that the udev threads on the giant machines with a lot of disk are
overwhelming each other in some sort of tight (either process creating
system time or some other resource constraint) loop and causing contention
and doing very little useful work.

Just an observation, this may have nothing to do with what you have going
on, but what you are describing sounds very close to what I debugged.   We
were doing "ps axuwwS | grep -i udev" just after boot up to determine how
much cpu time udev was getting during boot up, and determined that as we
lowered the children the time got less, and the boot up got faster and
stopped timing out.  And since udev was getting 90 minutes in an elapsed
time of around 120 seconds, it had to be using a significant number of
threads during boot up.  I believe these same udev threads call the pvscans.

Below is one case, but I know there are several other similar cases for
other distributions.    Note the number of default workers = 8 +
number_of_cpus * 64 which is going to be a disaster as it will result in
one thread per disk/lun being started at the same time or the
max_number_of_workers.  Either of which will result in a high degree of
non-productive useless system contention on a machine with a significant
number of luns.
https://www.suse.com/support/kb/doc/?id=000019156

On Sun, Jun 6, 2021 at 1:16 AM heming.zhao@suse.com <heming.zhao@suse.com>
wrote:

> Hello David & Zdenek,
>
> I send this mail for a well known performance issue:
>   when system is attached huge numbers of devices. (ie. 1000+ disks),
>   the lvm2-pvscan@.service costs too much time and systemd is very easy to
>   time out, and enter emergency shell in the end.
>
> This performance topic had been discussed in there some times, and the
> issue was
> lasting for many years. From the lvm2 latest code, this issue still can't
> be fix
> completely. The latest code add new function _pvscan_aa_quick(), which
> makes the
> booting time largely reduce but still can's fix this issue utterly.
>
> In my test env, x86 qemu-kvm machine, 6vcpu, 22GB mem, 1015 pv/vg/lv,
> comparing
> with/without _pvscan_aa_quick() code, booting time reduce from "9min 51s"
> to
> "2min 6s". But after switching to direct activation, the booting time is
> 8.7s
> (for longest lvm2 services: lvm2-activation-early.service).
>
> The hot spot of event activation is dev_cache_scan, which time complexity
> is
> O(n^2). And at the same time, systemd-udev worker will generate/run
> lvm2-pvscan@.service on all detecting disks. So the overall is O(n^3).
>
> ```
> dev_cache_scan //order: O(n^2)
>   + _insert_dirs //O(n)
>   | if obtain_device_list_from_udev() true
>   |   _insert_udev_dir //O(n)
>   |
>   + dev_cache_index_devs //O(n)
>
> There are 'n' lvm2-pvscan@.service running: O(n)
> Overall: O(n) * O(n^2) => O(n^3)
> ```
>
> Question/topic:
> Could we find out a final solution to have a good performance & scale well
> under
> event-based activation?
>
> Maybe two solutions (Martin & I discussed):
>
> 1. During boot phase, lvm2 automatically swithes to direct activation mode
> ("event_activation = 0"). After booted, switch back to the event
> activation mode.
>
> Booting phase is a speical stage. *During boot*, we could "pretend" that
> direct
> activation (event_activation=0) is set, and rely on
> lvm2-activation-*.service
> for PV detection. Once lvm2-activation-net.service has finished, we could
> "switch on" event activation.
>
> More precisely: pvscan --cache would look at some file under /run,
> e.g. /run/lvm2/boot-finished, and quit immediately if the file doesn't
> exist
> (as if event_activation=0 was set). In lvm2-activation-net.service, we
> would add
> something like:
>
> ```
> ExecStartPost=/bin/touch /run/lvm2/boot-finished
> ```
>
> ... so that, from this point in time onward, "pvscan --cache" would _not_
> quit
> immediately any more, but run normally (assuming that the global
> event_activation setting is 1). This way we'd get the benefit of using the
> static activation services during boot (good performance) while still
> being able
> to react to udev events after booting has finished.
>
> This idea would be worked out with very few code changes.
> The result would be a huge step forward on booting time.
>
>
> 2. change lvm2-pvscan@.service running mode from parallel to serival.
>
> This idea looks a little weird, it goes the opposite trend of today's
> programming technologies: parallel programming on multi-cores.
>
> idea:
> the action of lvm2 scaning "/dev" is hard to change, the outside parallel
> lvm2-pvscan@.service could change from parallel to serial.
>
> For example, a running pvscan instance could set a "running" flag in tmpfs
> (ie.
> /run/lvm/) indicating that no other pvscan process should be called in
> parallel.
> If another pvscan is invoked and sees "running", it would create a
> "pending"
> flag, and quit. Any other pvscan process seeing the "pending" flag would
> just quit. If the first instance sees the "pending" flag, it would
> atomically remove "pending" and restart itself, in order to catch any
> device
> that might have appeared since the previous sysfs scan.
> In most condition, devices had been found by once pvscan scanning,
> then next time of pvscan scanning should work with order O(n), because the
> target device had been inserted internal cache tree already. and on
> overall,
> there is only a single pvscan process would be running at any given time.
>
> We could create a list of pending to-be-scanned devices then (might be
> directory
> entries in some tmpfs directory). On exit, pvscan could check this dir and
> restart if it's non-empty.
>
>
> Thanks
> Heming
>
> _______________________________________________
> linux-lvm mailing list
> linux-lvm@redhat.com
> https://listman.redhat.com/mailman/listinfo/linux-lvm
> read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/
>
>

[-- Attachment #1.2: Type: text/html, Size: 8124 bytes --]

[-- Attachment #2: Type: text/plain, Size: 201 bytes --]

_______________________________________________
linux-lvm mailing list
linux-lvm@redhat.com
https://listman.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/