linux-nvdimm.lists.01.org archive mirror
 help / color / mirror / Atom feed
* [ndctl PATCH] monitor: Add epoll timeout for forcing a full dimm health check
@ 2020-02-21 17:54 Vaibhav Jain
  2020-03-31 22:15 ` Dan Williams
  0 siblings, 1 reply; 3+ messages in thread
From: Vaibhav Jain @ 2020-02-21 17:54 UTC (permalink / raw)
  To: linux-nvdimm; +Cc: Vaibhav Jain, Aneesh Kumar K . V

This patch adds a new command argument to the 'monitor' command namely
'--check-interval' that triggers a call to notify_dimm_event() at
regular intervals forcing a periodic check of dimm smart events.

This behavior is useful for dimms that do not support event notifications
in case the health status of an nvdimm changes. This is especially
true in case of PAPR-SCM dimms as the PHYP hyper-visor doesn't provide
any notifications to the guest kernel on a change in nvdimm health
status. In such case periodic polling of the is the only way to track
the health of a nvdimm.

The patch updates monitor_event() adding a timeout value to
epoll_wait() call. Also to prevent the possibility of a single dimm
generating enough events thereby preventing check of health status of
other nvdimms, a 'fullpoll_ts' time-stamp is added to keep track of
when full health check of all dimms happened. If after epoll_wait()
returns 'fullpoll_ts' time-stamp indicates last full dimm health check
happened beyond 'check-interval' seconds then a full dimm health check
is enforced.

Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com>
---
 Documentation/ndctl/ndctl-monitor.txt |  4 ++++
 ndctl/monitor.c                       | 31 ++++++++++++++++++++++++---
 2 files changed, 32 insertions(+), 3 deletions(-)

diff --git a/Documentation/ndctl/ndctl-monitor.txt b/Documentation/ndctl/ndctl-monitor.txt
index 2239f047266d..14cc59d57157 100644
--- a/Documentation/ndctl/ndctl-monitor.txt
+++ b/Documentation/ndctl/ndctl-monitor.txt
@@ -108,6 +108,10 @@ will not work if "--daemon" is specified.
 The monitor will attempt to enable the alarm control bits for all
 specified events.
 
+-i::
+--check-interval=::
+	Force a recheck of dimm health every <n> seconds.
+
 -u::
 --human::
 	Output monitor notification as human friendly json format instead
diff --git a/ndctl/monitor.c b/ndctl/monitor.c
index 1755b87a5eeb..b72c5852524e 100644
--- a/ndctl/monitor.c
+++ b/ndctl/monitor.c
@@ -4,6 +4,7 @@
 #include <stdio.h>
 #include <json-c/json.h>
 #include <libgen.h>
+#include <time.h>
 #include <dirent.h>
 #include <util/json.h>
 #include <util/filter.h>
@@ -33,6 +34,7 @@ static struct monitor {
 	bool daemon;
 	bool human;
 	bool verbose;
+	unsigned int poll_timeout;
 	unsigned int event_flags;
 	struct log_ctx ctx;
 } monitor;
@@ -322,9 +324,14 @@ static int monitor_event(struct ndctl_ctx *ctx,
 		struct monitor_filter_arg *mfa)
 {
 	struct epoll_event ev, *events;
-	int nfds, epollfd, i, rc = 0;
+	int nfds, epollfd, i, rc = 0, polltimeout = -1;
 	struct monitor_dimm *mdimm;
 	char buf;
+	/* last time a full poll happened */
+	struct timespec fullpoll_ts, ts;
+
+	if (monitor.poll_timeout)
+		polltimeout = monitor.poll_timeout * 1000;
 
 	events = calloc(mfa->num_dimm, sizeof(struct epoll_event));
 	if (!events) {
@@ -354,14 +361,30 @@ static int monitor_event(struct ndctl_ctx *ctx,
 		}
 	}
 
+	clock_gettime(CLOCK_BOOTTIME, &fullpoll_ts);
 	while (1) {
 		did_fail = 0;
-		nfds = epoll_wait(epollfd, events, mfa->num_dimm, -1);
-		if (nfds <= 0 && errno != EINTR) {
+		nfds = epoll_wait(epollfd, events, mfa->num_dimm, polltimeout);
+		if (nfds < 0 && errno != EINTR) {
 			err(&monitor, "epoll_wait error: (%s)\n", strerror(errno));
 			rc = -errno;
 			goto out;
 		}
+
+		/* If needed force a full poll of dimm health */
+		clock_gettime(CLOCK_BOOTTIME, &ts);
+		if ((fullpoll_ts.tv_sec - ts.tv_sec) > monitor.poll_timeout) {
+			nfds = 0;
+			dbg(&monitor, "forcing a full poll\n");
+		}
+
+		/* If we timed out then fill events array with all dimms */
+		if (nfds == 0) {
+			list_for_each(&mfa->dimms, mdimm, list)
+				events[nfds++].data.ptr = mdimm;
+			fullpoll_ts = ts;
+		}
+
 		for (i = 0; i < nfds; i++) {
 			mdimm = events[i].data.ptr;
 			if (util_dimm_event_filter(mdimm, monitor.event_flags)) {
@@ -570,6 +593,8 @@ int cmd_monitor(int argc, const char **argv, struct ndctl_ctx *ctx)
 				"use human friendly output formats"),
 		OPT_BOOLEAN('v', "verbose", &monitor.verbose,
 				"emit extra debug messages to log"),
+		OPT_UINTEGER('i', "check-interval", &monitor.poll_timeout,
+			     "force a dimm health recheck every <n> seconds"),
 		OPT_END(),
 	};
 	const char * const u[] = {
-- 
2.24.1
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [ndctl PATCH] monitor: Add epoll timeout for forcing a full dimm health check
  2020-02-21 17:54 [ndctl PATCH] monitor: Add epoll timeout for forcing a full dimm health check Vaibhav Jain
@ 2020-03-31 22:15 ` Dan Williams
  2020-04-01  5:36   ` Vaibhav Jain
  0 siblings, 1 reply; 3+ messages in thread
From: Dan Williams @ 2020-03-31 22:15 UTC (permalink / raw)
  To: Vaibhav Jain; +Cc: linux-nvdimm, Aneesh Kumar K . V

On Fri, Feb 21, 2020 at 9:55 AM Vaibhav Jain <vaibhav@linux.ibm.com> wrote:
>
> This patch adds a new command argument to the 'monitor' command namely
> '--check-interval' that triggers a call to notify_dimm_event() at
> regular intervals forcing a periodic check of dimm smart events.
>
> This behavior is useful for dimms that do not support event notifications
> in case the health status of an nvdimm changes. This is especially
> true in case of PAPR-SCM dimms as the PHYP hyper-visor doesn't provide
> any notifications to the guest kernel on a change in nvdimm health
> status. In such case periodic polling of the is the only way to track
> the health of a nvdimm.
>
> The patch updates monitor_event() adding a timeout value to
> epoll_wait() call. Also to prevent the possibility of a single dimm
> generating enough events thereby preventing check of health status of
> other nvdimms, a 'fullpoll_ts' time-stamp is added to keep track of
> when full health check of all dimms happened. If after epoll_wait()
> returns 'fullpoll_ts' time-stamp indicates last full dimm health check
> happened beyond 'check-interval' seconds then a full dimm health check
> is enforced.
>
> Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com>
> ---
>  Documentation/ndctl/ndctl-monitor.txt |  4 ++++
>  ndctl/monitor.c                       | 31 ++++++++++++++++++++++++---
>  2 files changed, 32 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/ndctl/ndctl-monitor.txt b/Documentation/ndctl/ndctl-monitor.txt
> index 2239f047266d..14cc59d57157 100644
> --- a/Documentation/ndctl/ndctl-monitor.txt
> +++ b/Documentation/ndctl/ndctl-monitor.txt
> @@ -108,6 +108,10 @@ will not work if "--daemon" is specified.
>  The monitor will attempt to enable the alarm control bits for all
>  specified events.
>
> +-i::
> +--check-interval=::
> +       Force a recheck of dimm health every <n> seconds.

I'd just call this "-p --poll="  to put the monitor into polled mode.

> +
>  -u::
>  --human::
>         Output monitor notification as human friendly json format instead
> diff --git a/ndctl/monitor.c b/ndctl/monitor.c
> index 1755b87a5eeb..b72c5852524e 100644
> --- a/ndctl/monitor.c
> +++ b/ndctl/monitor.c
> @@ -4,6 +4,7 @@
>  #include <stdio.h>
>  #include <json-c/json.h>
>  #include <libgen.h>
> +#include <time.h>
>  #include <dirent.h>
>  #include <util/json.h>
>  #include <util/filter.h>
> @@ -33,6 +34,7 @@ static struct monitor {
>         bool daemon;
>         bool human;
>         bool verbose;
> +       unsigned int poll_timeout;
>         unsigned int event_flags;
>         struct log_ctx ctx;
>  } monitor;
> @@ -322,9 +324,14 @@ static int monitor_event(struct ndctl_ctx *ctx,
>                 struct monitor_filter_arg *mfa)
>  {
>         struct epoll_event ev, *events;
> -       int nfds, epollfd, i, rc = 0;
> +       int nfds, epollfd, i, rc = 0, polltimeout = -1;
>         struct monitor_dimm *mdimm;
>         char buf;
> +       /* last time a full poll happened */
> +       struct timespec fullpoll_ts, ts;
> +
> +       if (monitor.poll_timeout)
> +               polltimeout = monitor.poll_timeout * 1000;
>
>         events = calloc(mfa->num_dimm, sizeof(struct epoll_event));
>         if (!events) {
> @@ -354,14 +361,30 @@ static int monitor_event(struct ndctl_ctx *ctx,
>                 }
>         }
>
> +       clock_gettime(CLOCK_BOOTTIME, &fullpoll_ts);
>         while (1) {
>                 did_fail = 0;
> -               nfds = epoll_wait(epollfd, events, mfa->num_dimm, -1);
> -               if (nfds <= 0 && errno != EINTR) {
> +               nfds = epoll_wait(epollfd, events, mfa->num_dimm, polltimeout);
> +               if (nfds < 0 && errno != EINTR) {
>                         err(&monitor, "epoll_wait error: (%s)\n", strerror(errno));
>                         rc = -errno;
>                         goto out;
>                 }
> +
> +               /* If needed force a full poll of dimm health */
> +               clock_gettime(CLOCK_BOOTTIME, &ts);
> +               if ((fullpoll_ts.tv_sec - ts.tv_sec) > monitor.poll_timeout) {
> +                       nfds = 0;
> +                       dbg(&monitor, "forcing a full poll\n");
> +               }

Why is this hunk above needed? If the DIMMs are firing events then all
of them that fired will get serviced and on any timeout all DIMMs will
get serviced. Why does fullpoll_ts need to be tracked?

> +
> +               /* If we timed out then fill events array with all dimms */
> +               if (nfds == 0) {
> +                       list_for_each(&mfa->dimms, mdimm, list)
> +                               events[nfds++].data.ptr = mdimm;
> +                       fullpoll_ts = ts;
> +               }
> +
>                 for (i = 0; i < nfds; i++) {
>                         mdimm = events[i].data.ptr;
>                         if (util_dimm_event_filter(mdimm, monitor.event_flags)) {
> @@ -570,6 +593,8 @@ int cmd_monitor(int argc, const char **argv, struct ndctl_ctx *ctx)
>                                 "use human friendly output formats"),
>                 OPT_BOOLEAN('v', "verbose", &monitor.verbose,
>                                 "emit extra debug messages to log"),
> +               OPT_UINTEGER('i', "check-interval", &monitor.poll_timeout,
> +                            "force a dimm health recheck every <n> seconds"),

I'd replace "force a dimm health check" with "poll and report status".

The monitor can theoretically also register for region and bus events,
so --poll option just forces the monitor to report the  all recorded
events not necessarily only dimm events.
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [ndctl PATCH] monitor: Add epoll timeout for forcing a full dimm health check
  2020-03-31 22:15 ` Dan Williams
@ 2020-04-01  5:36   ` Vaibhav Jain
  0 siblings, 0 replies; 3+ messages in thread
From: Vaibhav Jain @ 2020-04-01  5:36 UTC (permalink / raw)
  To: Dan Williams, Vaibhav Jain; +Cc: linux-nvdimm, Aneesh Kumar K . V


Thanks for reviewing this patch,

Dan Williams <dan.j.williams@intel.com> writes:

> On Fri, Feb 21, 2020 at 9:55 AM Vaibhav Jain <vaibhav@linux.ibm.com> wrote:
>>
>> This patch adds a new command argument to the 'monitor' command namely
>> '--check-interval' that triggers a call to notify_dimm_event() at
>> regular intervals forcing a periodic check of dimm smart events.
>>
>> This behavior is useful for dimms that do not support event notifications
>> in case the health status of an nvdimm changes. This is especially
>> true in case of PAPR-SCM dimms as the PHYP hyper-visor doesn't provide
>> any notifications to the guest kernel on a change in nvdimm health
>> status. In such case periodic polling of the is the only way to track
>> the health of a nvdimm.
>>
>> The patch updates monitor_event() adding a timeout value to
>> epoll_wait() call. Also to prevent the possibility of a single dimm
>> generating enough events thereby preventing check of health status of
>> other nvdimms, a 'fullpoll_ts' time-stamp is added to keep track of
>> when full health check of all dimms happened. If after epoll_wait()
>> returns 'fullpoll_ts' time-stamp indicates last full dimm health check
>> happened beyond 'check-interval' seconds then a full dimm health check
>> is enforced.
>>
>> Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com>
>> ---
>>  Documentation/ndctl/ndctl-monitor.txt |  4 ++++
>>  ndctl/monitor.c                       | 31 ++++++++++++++++++++++++---
>>  2 files changed, 32 insertions(+), 3 deletions(-)
>>
>> diff --git a/Documentation/ndctl/ndctl-monitor.txt b/Documentation/ndctl/ndctl-monitor.txt
>> index 2239f047266d..14cc59d57157 100644
>> --- a/Documentation/ndctl/ndctl-monitor.txt
>> +++ b/Documentation/ndctl/ndctl-monitor.txt
>> @@ -108,6 +108,10 @@ will not work if "--daemon" is specified.
>>  The monitor will attempt to enable the alarm control bits for all
>>  specified events.
>>
>> +-i::
>> +--check-interval=::
>> +       Force a recheck of dimm health every <n> seconds.
>
> I'd just call this "-p --poll="  to put the monitor into polled mode.
Agree to the suggestion. Will update the flag name in V2. 

>
>> +
>>  -u::
>>  --human::
>>         Output monitor notification as human friendly json format instead
>> diff --git a/ndctl/monitor.c b/ndctl/monitor.c
>> index 1755b87a5eeb..b72c5852524e 100644
>> --- a/ndctl/monitor.c
>> +++ b/ndctl/monitor.c
>> @@ -4,6 +4,7 @@
>>  #include <stdio.h>
>>  #include <json-c/json.h>
>>  #include <libgen.h>
>> +#include <time.h>
>>  #include <dirent.h>
>>  #include <util/json.h>
>>  #include <util/filter.h>
>> @@ -33,6 +34,7 @@ static struct monitor {
>>         bool daemon;
>>         bool human;
>>         bool verbose;
>> +       unsigned int poll_timeout;
>>         unsigned int event_flags;
>>         struct log_ctx ctx;
>>  } monitor;
>> @@ -322,9 +324,14 @@ static int monitor_event(struct ndctl_ctx *ctx,
>>                 struct monitor_filter_arg *mfa)
>>  {
>>         struct epoll_event ev, *events;
>> -       int nfds, epollfd, i, rc = 0;
>> +       int nfds, epollfd, i, rc = 0, polltimeout = -1;
>>         struct monitor_dimm *mdimm;
>>         char buf;
>> +       /* last time a full poll happened */
>> +       struct timespec fullpoll_ts, ts;
>> +
>> +       if (monitor.poll_timeout)
>> +               polltimeout = monitor.poll_timeout * 1000;
>>
>>         events = calloc(mfa->num_dimm, sizeof(struct epoll_event));
>>         if (!events) {
>> @@ -354,14 +361,30 @@ static int monitor_event(struct ndctl_ctx *ctx,
>>                 }
>>         }
>>
>> +       clock_gettime(CLOCK_BOOTTIME, &fullpoll_ts);
>>         while (1) {
>>                 did_fail = 0;
>> -               nfds = epoll_wait(epollfd, events, mfa->num_dimm, -1);
>> -               if (nfds <= 0 && errno != EINTR) {
>> +               nfds = epoll_wait(epollfd, events, mfa->num_dimm, polltimeout);
>> +               if (nfds < 0 && errno != EINTR) {
>>                         err(&monitor, "epoll_wait error: (%s)\n", strerror(errno));
>>                         rc = -errno;
>>                         goto out;
>>                 }
>> +
>> +               /* If needed force a full poll of dimm health */
>> +               clock_gettime(CLOCK_BOOTTIME, &ts);
>> +               if ((fullpoll_ts.tv_sec - ts.tv_sec) > monitor.poll_timeout) {
>> +                       nfds = 0;
>> +                       dbg(&monitor, "forcing a full poll\n");
>> +               }
>
> Why is this hunk above needed? If the DIMMs are firing events then all
> of them that fired will get serviced and on any timeout all DIMMs will
> get serviced. Why does fullpoll_ts need to be tracked?
>
This hunk was added to prevent any dimm thats firing too many events
from starving the poll of events from other dimms.

>> +
>> +               /* If we timed out then fill events array with all dimms */
>> +               if (nfds == 0) {
>> +                       list_for_each(&mfa->dimms, mdimm, list)
>> +                               events[nfds++].data.ptr = mdimm;
>> +                       fullpoll_ts = ts;
>> +               }
>> +
>>                 for (i = 0; i < nfds; i++) {
>>                         mdimm = events[i].data.ptr;
>>                         if (util_dimm_event_filter(mdimm, monitor.event_flags)) {
>> @@ -570,6 +593,8 @@ int cmd_monitor(int argc, const char **argv, struct ndctl_ctx *ctx)
>>                                 "use human friendly output formats"),
>>                 OPT_BOOLEAN('v', "verbose", &monitor.verbose,
>>                                 "emit extra debug messages to log"),
>> +               OPT_UINTEGER('i', "check-interval", &monitor.poll_timeout,
>> +                            "force a dimm health recheck every <n> seconds"),
>
> I'd replace "force a dimm health check" with "poll and report status".
Fair point. Will update the this usage text in v2.
>
> The monitor can theoretically also register for region and bus events,
> so --poll option just forces the monitor to report the  all recorded
> events not necessarily only dimm events.
Agree

-- 
Vaibhav Jain <vaibhav@linux.ibm.com>
Linux Technology Center, IBM India Pvt. Ltd.
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2020-04-01  5:36 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-21 17:54 [ndctl PATCH] monitor: Add epoll timeout for forcing a full dimm health check Vaibhav Jain
2020-03-31 22:15 ` Dan Williams
2020-04-01  5:36   ` Vaibhav Jain

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).