On 17.05.21 17:55, Julien Grall wrote: > Hi Juergen, > > On 17/05/2021 07:47, Juergen Gross wrote: >> On 14.05.21 22:19, Julien Grall wrote: >>> Hi Juergen, >>> >>> On 14/05/2021 09:41, Juergen Gross wrote: >>>> Xenstored is absolutely mandatory for a Xen host and it can't be >>>> restarted, so being killed by OOM-killer in case of memory shortage is >>>> to be avoided. >>>> >>>> Set /proc/$pid/oom_score_adj (if available) to -500 in order to allow >>>> xenstored to use large amounts of memory without being killed. >>>> >>>> In order to support large numbers of domains the limit for open file >>>> descriptors might need to be raised. Each domain needs 2 file >>>> descriptors (one for the event channel and one for the xl per-domain >>>> daemon to connect to xenstored). >>> >>> Hmmm... AFAICT there is only one file descriptor to handle all the >>> event channels. Could you point out the code showing one event FD per >>> domain? >> >> I have let me fooled by just counting the file descriptors used with one >> or two domain active. >> >> So you are right that all event channels only use one fd, but each xl >> daemon will use two (which should be fixed, IMO). And thinking more >> about it it is even worse: each qemu process will at least require one >> additional fd. >> >>> >>>> >>>> Try to raise ulimit for open files to 65536. First the hard limit if >>>> needed, and then the soft limit. >>> >>> I am not sure this is right to impose this limit to everyone. For >>> instance, one admin may know that there will be no more than 100 >>> domains on its system. >> >> Is setting a higher limit really a problem? > > I am quite unease to set a limit that nearly nobody will reach unless > something went horribly wrong on the system. Hmm, I really don't see the downside of having the possibility to let xenstored open lots of files. Anyway we can do it via prlimit if you like that better. > >> >>> So the admin should be able to configure them. At this point, I think >>> the two limit should be set my the initscript rather than xenstored >>> itself. >> >> But the admin would need to know the Xen internals for selecting the >> correct limits. In the end I'd be fine with moving this modification to >> the script starting Xenstore (which would be launch-xenstore), but the >> configuration item should be "max number of domains to support". > > I would be fine with "max numer of domains to support". What I care the > most here is the limits are actually applied most of (if not all) the time. I did another test and found: - the xl daemon for a guest will use 2 socket connections - qemu for a HVM guest will use 3 socket connections - qemu for a PV guest is using one socket connection - 14 other files are used by xenstored So we should set the limit to 5 * n_dom + 100 (some headroom will be nice IMO). > >> >>> >>> This would also avoid the problem where Xenstored is not allowed to >>> modify its limit (see more below). >>> >>>> >>>> Signed-off-by: Juergen Gross >>>> --- >>>>   tools/xenstore/xenstored_core.c   |  2 ++ >>>>   tools/xenstore/xenstored_core.h   |  3 ++ >>>>   tools/xenstore/xenstored_minios.c |  4 +++ >>>>   tools/xenstore/xenstored_posix.c  | 46 >>>> +++++++++++++++++++++++++++++++ >>>>   4 files changed, 55 insertions(+) >>>> >>>> diff --git a/tools/xenstore/xenstored_core.c >>>> b/tools/xenstore/xenstored_core.c >>>> index b66d119a98..964e693450 100644 >>>> --- a/tools/xenstore/xenstored_core.c >>>> +++ b/tools/xenstore/xenstored_core.c >>>> @@ -2243,6 +2243,8 @@ int main(int argc, char *argv[]) >>>>           xprintf = trace; >>>>   #endif >>>> +    claim_resources(); >>>> + >>>>       signal(SIGHUP, trigger_reopen_log); >>>>       if (tracefile) >>>>           tracefile = talloc_strdup(NULL, tracefile); >>>> diff --git a/tools/xenstore/xenstored_core.h >>>> b/tools/xenstore/xenstored_core.h >>>> index 1467270476..ac26973648 100644 >>>> --- a/tools/xenstore/xenstored_core.h >>>> +++ b/tools/xenstore/xenstored_core.h >>>> @@ -255,6 +255,9 @@ void daemonize(void); >>>>   /* Close stdin/stdout/stderr to complete daemonize */ >>>>   void finish_daemonize(void); >>>> +/* Set OOM-killer score and raise ulimit. */ >>>> +void claim_resources(void); >>>> + >>>>   /* Open a pipe for signal handling */ >>>>   void init_pipe(int reopen_log_pipe[2]); >>>> diff --git a/tools/xenstore/xenstored_minios.c >>>> b/tools/xenstore/xenstored_minios.c >>>> index c94493e52a..df8ff580b0 100644 >>>> --- a/tools/xenstore/xenstored_minios.c >>>> +++ b/tools/xenstore/xenstored_minios.c >>>> @@ -32,6 +32,10 @@ void finish_daemonize(void) >>>>   { >>>>   } >>>> +void claim_resources(void) >>>> +{ >>>> +} >>>> + >>>>   void init_pipe(int reopen_log_pipe[2]) >>>>   { >>>>       reopen_log_pipe[0] = -1; >>>> diff --git a/tools/xenstore/xenstored_posix.c >>>> b/tools/xenstore/xenstored_posix.c >>>> index 48c37ffe3e..0074fbd8b2 100644 >>>> --- a/tools/xenstore/xenstored_posix.c >>>> +++ b/tools/xenstore/xenstored_posix.c >>>> @@ -22,6 +22,7 @@ >>>>   #include >>>>   #include >>>>   #include >>>> +#include >>>>   #include "utils.h" >>>>   #include "xenstored_core.h" >>>> @@ -87,6 +88,51 @@ void finish_daemonize(void) >>>>       close(devnull); >>>>   } >>>> +static void avoid_oom_killer(void) >>>> +{ >>>> +    char path[32]; >>>> +    char val[] = "-500"; >>>> +    int fd; >>>> + >>>> +    snprintf(path, sizeof(path), "/proc/%d/oom_score_adj", >>>> (int)getpid()); >>> >>> This looks Linux specific. How about other OSes? >> >> I don't know whether other OSes have an OOM killer, and if they do, how >> to configure it. It is a best effort attempt, after all. > > I have CCed Roger who should be able to help for FreeBSD. It would be possible to set the OOM-score from the launch script, too. > >> >>> >>>> + >>>> +    fd = open(path, O_WRONLY); >>>> +    /* Do nothing if file doesn't exist. */ >>> >>> Your commit message leads to think that we *must* configure the OOM. >>> If not, then we should not continue. But here, this suggest this is >>> optional. In fact... >> >> I can modify the commit message by adding a "Try to". >> >>> >>>> +    if (fd < 0) >>>> +        return; >>>> +    /* Ignore errors. */ >>>> +    write(fd, val, sizeof(val)); >>> >>> ... xenstored may not be allowed to modify its own parameters. So >>> this would continue silently without the admin necessarily knowning >>> the limit wasn't applied. >> >> I can add a line in the Xenstore log in this regard. > > This feels wrong to me. If a limit cannot be applied then it should fail > early rather than possibly at the wrong moment a few days (months?) after. I think issuing a warning would be better here. We are running with no OOM adjustments since years now. Juergen