Re: Predicting Process crash / Memory utlization using machine learning

From: "Valdis Klētnieks" <valdis.kletnieks@vt.edu>
To: prathamesh naik <prathamesh.naik20@gmail.com>
Cc: kernelnewbies@kernelnewbies.org
Subject: Re: Predicting Process crash / Memory utlization using machine learning
Date: Wed, 09 Oct 2019 17:28:26 -0400	[thread overview]
Message-ID: <177218.1570656506@turing-police> (raw)
In-Reply-To: <CAGG2BF5MQ1ZZCOZgKE3qznRo=Ro5Df2Hus5pKas9JZbBa0+=Sw@mail.gmail.com>

[-- Attachment #1.1: Type: text/plain, Size: 2107 bytes --]

On Wed, 09 Oct 2019 01:23:28 -0700, prathamesh naik said:
>             I want to work on project which can predict kernel process
> crash or even user space process crash (or memory usage spikes) using
> machine learning algorithms. 

This sounds like it's isomorphic to the Turing Halting Problem, and there's
plenty of other good reasons to think that predicting a process crash is, in
general, somewhere between "very difficult" and "impossible".

Even "memory usage spikes" are going to be a challenge.

Consider a program that's doing an in-memory sort. Your machine has 16 gig of
memory, and 2 gig of swap.  It's known that the sort algorithm requires 1.5G of
memory for each gigabyte of input data.

Does the system start to thrash, or crash entirely, or does the sort complete
without issues?  There's no way to make a prediction without knowing the size
of the input data.  And if you're dealing with something like 

grep <regexp> file | predictable-memory-sort

where 'file' is a logfile *much* bigger than memory....

You can see where this is heading...

Bottom line:  I'm pretty convinced that in the general case, you can't do much
better than current monitoring systems already do: Look at free space, look at
the free space trendline for the past 5 minutes or whatever, and issue an alert
if the current trend indicates exhaustion in under 15 minutes.

Now, what *might* be interesting is seeing if machine learning across multiple
events is able to suggest better values than 5 and 15 minutes, to provide a
best tradeoff between issuing an alert early enough that a sysadmin can take
action, and avoiding issuing early alerts that turn out to be false alarms.

The problem there is that getting enough data on actual production systems
will be difficult, because sysadmins usually don't leave sub-optimal configuration
settings in place so you can gather data.

And data gathered for machine learning on an intentionally misconfigured test
system won't be applicable to other machines.

Good luck, this problem is a lot harder than it looks....

[-- Attachment #1.2: Type: application/pgp-signature, Size: 832 bytes --]

[-- Attachment #2: Type: text/plain, Size: 170 bytes --]

_______________________________________________
Kernelnewbies mailing list
Kernelnewbies@kernelnewbies.org
https://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies