From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.6 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,HTML_MESSAGE,INCLUDES_CR_TRAILER, INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9C794C4338F for ; Tue, 17 Aug 2021 04:52:08 +0000 (UTC) Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id DEBAB60E78 for ; Tue, 17 Aug 2021 04:52:07 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org DEBAB60E78 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=nongnu.org Received: from localhost ([::1]:38700 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1mFr54-0007Yl-P5 for qemu-devel@archiver.kernel.org; Tue, 17 Aug 2021 00:52:06 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:35060) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1mFr4A-0006ne-TF for qemu-devel@nongnu.org; Tue, 17 Aug 2021 00:51:11 -0400 Received: from us-smtp-delivery-124.mimecast.com ([170.10.133.124]:39075) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1mFr45-0002Xo-7k for qemu-devel@nongnu.org; Tue, 17 Aug 2021 00:51:09 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1629175863; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=yTTiGItEpODMeQqAMmTo2/rj8tEQ/JPwpssft3/CoxA=; b=IW2BhWsdxJzNohd5Wyz8gsvh2gqZERInG1IHRasNSsKMi7VNPpw0mHSelvZfyV3JbHwMiG OBmxXYRzuy/vjhEd+V8x6A/1U040tINl9SyJ0DA/ewUnaE28asBIqOLSomLZ/10g8L307g N5wRfbDXGSekqZbL0QlLwhGzcLbDt4w= Received: from mail-oo1-f70.google.com (mail-oo1-f70.google.com [209.85.161.70]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-22-aI8hIh0JPjOFtsjRqJHIiw-1; Tue, 17 Aug 2021 00:50:58 -0400 X-MC-Unique: aI8hIh0JPjOFtsjRqJHIiw-1 Received: by mail-oo1-f70.google.com with SMTP id w6-20020a4adec6000000b0028b7d13a4c8so2823225oou.13 for ; Mon, 16 Aug 2021 21:50:57 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=yTTiGItEpODMeQqAMmTo2/rj8tEQ/JPwpssft3/CoxA=; b=kM54QFa87WFaf30UaCW2iJUetdymbU/tiGhF40/To3ij/3MY4uCiQpQJL0es0t8UxT P4cZtAiclfri34ZVqO79QE4Aa4X/oN+4A52GQO5rXL4P2NZDUFUBOd263yUsAmDGxVO0 F1LwwZrn3X5UbgPPKU5Qc1esyLrc3B3JRLPuKx44+253NaYw/mMOqcYEnMk7VJByJpkG 3Qs3GMD8PbIvBBpyR+ur70chl9fg/43du6Z2XeqoLIZ2luwv2OMJLfNh0O8b3aEv/YsJ BorbboZtoXjlPAzDsuzi0cAeSygwlYz1RNhGzC81F3xiGGbgP0F9HCJNIlsezQybNUyr EUVw== X-Gm-Message-State: AOAM5318GkBVmy/vpajcl/qE6Ae5zJGbslbCJgb8afPAUNCVfU6wHbkK isHbHJCS2F+s+E4bzcCOrkKKnyp75GLFMNAtRKWzcsHTzdjzh7nBGC2xeFlpdW5VELiUbfAM2O/ KrOYqTDrq6dzFYsw1ojW+PqGk7L14Bx8= X-Received: by 2002:a05:6830:1604:: with SMTP id g4mr1193688otr.45.1629175857360; Mon, 16 Aug 2021 21:50:57 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwytrWFZkIHoFQexaE128K5OpX0kCWZ59Q6kR4BL8zNBAWRLQac/0DOr8i4genOpFlrPTikIA4udN/LjXWIE68= X-Received: by 2002:a05:6830:1604:: with SMTP id g4mr1193671otr.45.1629175857147; Mon, 16 Aug 2021 21:50:57 -0700 (PDT) MIME-Version: 1.0 References: <20210730201846.5147-1-niteesh.gs@gmail.com> <20210730201846.5147-10-niteesh.gs@gmail.com> In-Reply-To: <20210730201846.5147-10-niteesh.gs@gmail.com> From: John Snow Date: Tue, 17 Aug 2021 00:50:46 -0400 Message-ID: Subject: Re: [PATCH v3 09/13] python/aqmp-tui: Add QMP connection manager To: G S Niteesh Babu Authentication-Results: relay.mimecast.com; auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=jsnow@redhat.com X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: multipart/alternative; boundary="0000000000003e046505c9ba109f" Received-SPF: pass client-ip=170.10.133.124; envelope-from=jsnow@redhat.com; helo=us-smtp-delivery-124.mimecast.com X-Spam_score_int: -34 X-Spam_score: -3.5 X-Spam_bar: --- X-Spam_report: (-3.5 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.698, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Eduardo Habkost , Kashyap Chamarthy , Markus Armbruster , Wainer Moschetta , qemu-devel , Stefan Hajnoczi , Cleber Rosa , Eric Blake Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" --0000000000003e046505c9ba109f Content-Type: text/plain; charset="UTF-8" On Fri, Jul 30, 2021 at 4:19 PM G S Niteesh Babu wrote: > Instead of manually connecting and disconnecting from the > server. We now rely on the runstate to manage the QMP > connection. > > Along with this the ability to reconnect on certain exceptions > has also been added. > > Signed-off-by: G S Niteesh Babu > --- > python/qemu/aqmp/aqmp_tui.py | 109 ++++++++++++++++++++++++++++++----- > 1 file changed, 94 insertions(+), 15 deletions(-) > > diff --git a/python/qemu/aqmp/aqmp_tui.py b/python/qemu/aqmp/aqmp_tui.py > index 0d5ec62cb7..ef91883fa5 100644 > --- a/python/qemu/aqmp/aqmp_tui.py > +++ b/python/qemu/aqmp/aqmp_tui.py > @@ -25,8 +25,9 @@ > import urwid_readline > > from ..qmp import QEMUMonitorProtocol, QMPBadPortError > +from .error import ProtocolError > from .message import DeserializationError, Message, UnexpectedTypeError > -from .protocol import ConnectError > +from .protocol import ConnectError, Runstate > from .qmp_client import ExecInterruptedError, QMPClient > from .util import create_task, pretty_traceback > > @@ -67,12 +68,24 @@ def format_json(msg: str) -> str: > return ' '.join(words) > > > +def type_name(mtype: Any) -> str: > + """ > + Returns the type name > + """ > + return type(mtype).__name__ > This is a lot of lines for something that doesn't do very much -- do we really need it? > + > + > class App(QMPClient): > - def __init__(self, address: Union[str, Tuple[str, int]]) -> None: > + def __init__(self, address: Union[str, Tuple[str, int]], num_retries: > int, > + retry_delay: Optional[int]) -> None: > urwid.register_signal(type(self), UPDATE_MSG) > self.window = Window(self) > self.address = address > self.aloop: Optional[Any] = None # FIXME: Use more concrete type. > + self.num_retries = num_retries > + self.retry_delay = retry_delay > + self.retry: bool = False > + self.disconnecting: bool = False > Why is this one needed again ? ... > super().__init__() > > def add_to_history(self, msg: str, level: Optional[str] = None) -> > None: > @@ -119,7 +132,7 @@ def _cb_inbound(self, msg: Message) -> Message: > LOGGER.info('Error server disconnected before reply') > urwid.emit_signal(self, UPDATE_MSG, > '{"error": "Server disconnected before > reply"}') > - self._set_status("Server disconnected") > + await self.disconnect() > except Exception as err: > LOGGER.error('Exception from _send_to_server: %s', str(err)) > raise err > @@ -136,15 +149,29 @@ def kill_app(self) -> None: > create_task(self._kill_app()) > > async def _kill_app(self) -> None: > - # It is ok to call disconnect even in disconnect state > + await self.disconnect() > + LOGGER.debug('Disconnect finished. Exiting app') > + raise urwid.ExitMainLoop() > + > + async def disconnect(self) -> None: > + if self.disconnecting: > + return > try: > - await self.disconnect() > - LOGGER.debug('Disconnect finished. Exiting app') > + self.disconnecting = True > + await super().disconnect() > + self.retry = True > + except EOFError as err: > + LOGGER.info('disconnect: %s', type_name(err)) > + self.retry = True > + except ProtocolError as err: > + LOGGER.info('disconnect: %s', type_name(err)) > + self.retry = False > except Exception as err: > - LOGGER.info('_kill_app: %s', str(err)) > - # Let the app crash after providing a proper stack trace > + LOGGER.error('disconnect: Unhandled exception %s', str(err)) > + self.retry = False > raise err > - raise urwid.ExitMainLoop() > + finally: > + self.disconnecting = False > > def handle_event(self, event: Message) -> None: > # FIXME: Consider all states present in qapi/run-state.json > @@ -161,14 +188,61 @@ def _get_formatted_address(self) -> str: > addr = f'{host}:{port}' > return addr > > - async def connect_server(self) -> None: > + async def _retry_connection(self) -> Optional[str]: > + current_retries = 0 > + err = None > + # Increase in power sequence of 2 if no delay is provided > + cur_delay = 1 > + inc_delay = 2 > + if self.retry_delay: > + inc_delay = 1 > + cur_delay = self.retry_delay > + # initial try > + await self.connect_server() > + while self.retry and current_retries < self.num_retries: > + LOGGER.info('Connection Failed, retrying in %d', cur_delay) > + status = f'[Retry #{current_retries} ({cur_delay}s)]' > + self._set_status(status) > + > + await asyncio.sleep(cur_delay) > + > + err = await self.connect_server() > + cur_delay *= inc_delay > + # Cap delay to 5mins > + cur_delay = min(cur_delay, 5 * 60) > + current_retries += 1 > + # If all retries failed report the last error > + LOGGER.info('All retries failed: %s', str(err)) > + return type_name(err) > I had suggested something like an exponential backoff, but maybe a constant delay would be a little cleaner to implement for right now without getting too fancy over it. If you go with a simpler retry algorithm, do you think you could clean up the logic in the retry loop here a bit more? Something like: for _ in range(num_retries): try: whatever_you_have_to_do_to_connect() return except ConnectError as err: LOGGER.info(...etc) await asyncio.sleep(whatever_the_delay_is) # ran out of retries here, presumably the connection manager will just go idle until the user interferes some other way. In particular, I think passing around the name of the exception is a little dubious -- we should be logging with the actual Exception we've received. > + > + async def manage_connection(self) -> None: > + while True: > + if self.runstate == Runstate.IDLE: > + LOGGER.info('Trying to reconnect') > But will this be true upon the very first boot? This message might not be right. > + err = await self._retry_connection() > This seems named oddly too, since it might be the initial attempt and not necessarily a reconnection or a retry. > + # If retry is still true then, we have exhausted all our > tries. > + if self.retry: > + self._set_status(f'Error: {err}') > + else: > + addr = self._get_formatted_address() > + self._set_status(f'[Connected {addr}]') > + elif self.runstate == Runstate.DISCONNECTING: > + self._set_status('[Disconnected]') > + await self.disconnect() > + # check if a retry is needed > Is this required? I would have hoped that after calling disconnect that the state would have again changed to IDLE and you wouldn't need this clause here. > + if self.runstate == Runstate.IDLE: > + continue > + await self.runstate_changed() > + > + async def connect_server(self) -> Optional[str]: > try: > await self.connect(self.address) > - addr = self._get_formatted_address() > - self._set_status(f'Connected to {addr}') > + self.retry = False > except ConnectError as err: > LOGGER.info('connect_server: ConnectError %s', str(err)) > - self._set_status('Server shutdown') > + self.retry = True > + return type_name(err) > + return None > > def run(self, debug: bool = False) -> None: > screen = urwid.raw_display.Screen() > @@ -191,7 +265,7 @@ def run(self, debug: bool = False) -> None: > event_loop=event_loop) > > create_task(self.wait_for_events(), self.aloop) > - create_task(self.connect_server(), self.aloop) > + create_task(self.manage_connection(), self.aloop) > try: > main_loop.run() > except Exception as err: > @@ -333,6 +407,11 @@ def main() -> None: > parser = argparse.ArgumentParser(description='AQMP TUI') > parser.add_argument('qmp_server', help='Address of the QMP server' > '< UNIX socket path | TCP addr:port >') > + parser.add_argument('--num-retries', type=int, default=10, > + help='Number of times to reconnect before giving > up') > + parser.add_argument('--retry-delay', type=int, > + help='Time(s) to wait before next retry.' > + 'Default action is to increase delay in powers of > 2') > parser.add_argument('--log-file', help='The Log file name') > parser.add_argument('--log-level', default='WARNING', > help='Log level > ') > @@ -348,7 +427,7 @@ def main() -> None: > except QMPBadPortError as err: > parser.error(str(err)) > > - app = App(address) > + app = App(address, args.num_retries, args.retry_delay) > > if args.log_file: > LOGGER.addHandler(logging.FileHandler(args.log_file)) > -- > 2.17.1 > > Right idea overall - possibly needs some polish and to be integrated with an earlier patch to avoid the intermediate FIXMEs. Thanks, --js --0000000000003e046505c9ba109f Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable


=
On Fri, Jul 30, 2021 at 4:19 PM G S N= iteesh Babu <niteesh.gs@gmail.co= m> wrote:
Instead of manually connecting and disconnecting from the
server. We now rely on the runstate to manage the QMP
connection.

Along with this the ability to reconnect on certain exceptions
has also been added.

Signed-off-by: G S Niteesh Babu <niteesh.gs@gmail.com>
---
=C2=A0python/qemu/aqmp/aqmp_tui.py | 109 ++++++++++++++++++++++++++++++----= -
=C2=A01 file changed, 94 insertions(+), 15 deletions(-)

diff --git a/python/qemu/aqmp/aqmp_tui.py b/python/qemu/aqmp/aqmp_tui.py index 0d5ec62cb7..ef91883fa5 100644
--- a/python/qemu/aqmp/aqmp_tui.py
+++ b/python/qemu/aqmp/aqmp_tui.py
@@ -25,8 +25,9 @@
=C2=A0import urwid_readline

=C2=A0from ..qmp import QEMUMonitorProtocol, QMPBadPortError
+from .error import ProtocolError
=C2=A0from .message import DeserializationError, Message, UnexpectedTypeErr= or
-from .protocol import ConnectError
+from .protocol import ConnectError, Runstate
=C2=A0from .qmp_client import ExecInterruptedError, QMPClient
=C2=A0from .util import create_task, pretty_traceback

@@ -67,12 +68,24 @@ def format_json(msg: str) -> str:
=C2=A0 =C2=A0 =C2=A0return ' '.join(words)


+def type_name(mtype: Any) -> str:
+=C2=A0 =C2=A0 """
+=C2=A0 =C2=A0 Returns the type name
+=C2=A0 =C2=A0 """
+=C2=A0 =C2=A0 return type(mtype).__name__

<= div>This is a lot of lines for something that doesn't do very much -- d= o we really need it?
=C2=A0
+
+
=C2=A0class App(QMPClient):
-=C2=A0 =C2=A0 def __init__(self, address: Union[str, Tuple[str, int]]) -&g= t; None:
+=C2=A0 =C2=A0 def __init__(self, address: Union[str, Tuple[str, int]], num= _retries: int,
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0retry_delay:= Optional[int]) -> None:
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0urwid.register_signal(type(self), UPDATE_= MSG)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0self.window =3D Window(self)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0self.address =3D address
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0self.aloop: Optional[Any] =3D None=C2=A0 = # FIXME: Use more concrete type.
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 self.num_retries =3D num_retries
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 self.retry_delay =3D retry_delay
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 self.retry: bool =3D False
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 self.disconnecting: bool =3D False

Why is this one needed again ? ...
= =C2=A0
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0super().__init__()

=C2=A0 =C2=A0 =C2=A0def add_to_history(self, msg: str, level: Optional[str]= =3D None) -> None:
@@ -119,7 +132,7 @@ def _cb_inbound(self, msg: Message) -> Message:
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0LOGGER.info('Error serv= er disconnected before reply')
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0urwid.emit_signal(self, UPD= ATE_MSG,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0'{"error": "Server= disconnected before reply"}')
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 self._set_status("Server di= sconnected")
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 await self.disconnect()
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0except Exception as err:
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0LOGGER.error('Exception= from _send_to_server: %s', str(err))
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0raise err
@@ -136,15 +149,29 @@ def kill_app(self) -> None:
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0create_task(self._kill_app())

=C2=A0 =C2=A0 =C2=A0async def _kill_app(self) -> None:
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 # It is ok to call disconnect even in disconne= ct state
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 await self.disconnect()
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 LOGGER.debug('Disconnect finished. Exiting= app')
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 raise urwid.ExitMainLoop()
+
+=C2=A0 =C2=A0 async def disconnect(self) -> None:
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 if self.disconnecting:
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 return
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0try:
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 await self.disconnect()
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 LOGGER.debug('Disconnect fin= ished. Exiting app')
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 self.disconnecting =3D True
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 await super().disconnect()
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 self.retry =3D True
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 except EOFError as err:
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 LOGGER.info('disconnect: %s&= #39;, type_name(err))
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 self.retry =3D True
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 except ProtocolError as err:
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 LOGGER.info('disconnect: %s&= #39;, type_name(err))
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 self.retry =3D False
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0except Exception as err:
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 LOGGER.info('_kill_app: %s&#= 39;, str(err))
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 # Let the app crash after provid= ing a proper stack trace
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 LOGGER.error('disconnect: Un= handled exception %s', str(err))
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 self.retry =3D False
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0raise err
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 raise urwid.ExitMainLoop()
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 finally:
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 self.disconnecting =3D False

=C2=A0 =C2=A0 =C2=A0def handle_event(self, event: Message) -> None:
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0# FIXME: Consider all states present in q= api/run-state.json
@@ -161,14 +188,61 @@ def _get_formatted_address(self) -> str:
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0addr =3D f'{host}:{port= }'
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0return addr

-=C2=A0 =C2=A0 async def connect_server(self) -> None:
+=C2=A0 =C2=A0 async def _retry_connection(self) -> Optional[str]:
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 current_retries =3D 0
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 err =3D None
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 # Increase in power sequence of 2 if no delay = is provided
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 cur_delay =3D 1
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 inc_delay =3D 2
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 if self.retry_delay:
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 inc_delay =3D 1
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 cur_delay =3D self.retry_delay +=C2=A0 =C2=A0 =C2=A0 =C2=A0 # initial try
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 await self.connect_server()
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 while self.retry and current_retries < self= .num_retries:
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 LOGGER.info('Connection Fail= ed, retrying in %d', cur_delay)
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 status =3D f'[Retry #{curren= t_retries} ({cur_delay}s)]'
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 self._set_status(status)
+
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 await asyncio.sleep(cur_delay) +
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 err =3D await self.connect_serve= r()
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 cur_delay *=3D inc_delay
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 # Cap delay to 5mins
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 cur_delay =3D min(cur_delay, 5 *= 60)
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 current_retries +=3D 1
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 # If all retries failed report the last error<= br> +=C2=A0 =C2=A0 =C2=A0 =C2=A0 LOGGER.info('All retries failed: %s', = str(err))
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 return type_name(err)
I had suggested something like an exponential backoff, but mayb= e a constant delay would be a little cleaner to implement for right now wit= hout getting too fancy over it. If you go with a simpler retry algorithm, d= o you think you could clean up the logic in the retry loop here a bit more?=

Something like:

for _ in range(num_retries):
=C2= =A0=C2=A0=C2=A0 try:
=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 whatever_you_have_to_do_to_connect()
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 return
=C2=A0=C2=A0=C2=A0 except ConnectError as err:<= /div>
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 = LOGGER.info(...etc)
=C2=A0=C2=A0=C2=A0 awai= t asyncio.sleep(whatever_the_delay_is)
# ra= n out of retries here, presumably the connection manager will just go idle = until the user interferes some other way.

In particular, I think passing arou= nd the name of the exception is a little dubious -- we should be logging wi= th the actual Exception we've received.
=C2=A0
+
+=C2=A0 =C2=A0 async def manage_connection(self) -> None:
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 while True:
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if self.runstate =3D=3D Runstate= .IDLE:
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 LOGGER.info('T= rying to reconnect')

But will this = be true upon the very first boot? This message might not be right.
=C2=A0
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 err =3D await self= ._retry_connection()

This seems named o= ddly too, since it might be the initial attempt and not necessarily a recon= nection or a retry.
=C2=A0
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 # If retry is stil= l true then, we have exhausted all our tries.
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if self.retry:
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 self= ._set_status(f'Error: {err}')
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 else:
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 addr= =3D self._get_formatted_address()
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 self= ._set_status(f'[Connected {addr}]')
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 elif self.runstate =3D=3D Runsta= te.DISCONNECTING:
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 self._set_status(&= #39;[Disconnected]')
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 await self.disconn= ect()
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 # check if a retry= is needed

Is this required? I would ha= ve hoped that after calling disconnect that the state would have again chan= ged to IDLE and you wouldn't need this clause here.
=C2= =A0
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 if self.runstate = =3D=3D Runstate.IDLE:
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 cont= inue
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 await self.runstate_changed() +
+=C2=A0 =C2=A0 async def connect_server(self) -> Optional[str]:
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0try:
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0await self.connect(self.add= ress)
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 addr =3D self._get_formatted_add= ress()
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 self._set_status(f'Connected= to {addr}')
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 self.retry =3D False
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0except ConnectError as err:
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0LOGGER.info('connect_se= rver: ConnectError %s', str(err))
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 self._set_status('Server shu= tdown')
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 self.retry =3D True
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 return type_name(err)
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 return None

=C2=A0 =C2=A0 =C2=A0def run(self, debug: bool =3D False) -> None:
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0screen =3D urwid.raw_display.Screen()
@@ -191,7 +265,7 @@ def run(self, debug: bool =3D False) -> None:
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 event_loop=3Devent_loo= p)

=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0create_task(self.wait_for_events(), self.= aloop)
-=C2=A0 =C2=A0 =C2=A0 =C2=A0 create_task(self.connect_server(), self.aloop)=
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 create_task(self.manage_connection(), self.alo= op)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0try:
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0main_loop.run()
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0except Exception as err:
@@ -333,6 +407,11 @@ def main() -> None:
=C2=A0 =C2=A0 =C2=A0parser =3D argparse.ArgumentParser(description=3D'A= QMP TUI')
=C2=A0 =C2=A0 =C2=A0parser.add_argument('qmp_server', help=3D'A= ddress of the QMP server'
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0'< UNIX socket path | TCP addr:port >')
+=C2=A0 =C2=A0 parser.add_argument('--num-retries', type=3Dint, def= ault=3D10,
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 help=3D'Number of times to reconnect before giving up')<= br> +=C2=A0 =C2=A0 parser.add_argument('--retry-delay', type=3Dint,
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 help=3D'Time(s) to wait before next retry.'
+=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 'Default action is to increase delay in powers of 2') =C2=A0 =C2=A0 =C2=A0parser.add_argument('--log-file', help=3D'T= he Log file name')
=C2=A0 =C2=A0 =C2=A0parser.add_argument('--log-level', default=3D&#= 39;WARNING',
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0help=3D'Log level <CRITICAL|ERROR|WARNING|INFO|DEBU= G|>')
@@ -348,7 +427,7 @@ def main() -> None:
=C2=A0 =C2=A0 =C2=A0except QMPBadPortError as err:
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0parser.error(str(err))

-=C2=A0 =C2=A0 app =3D App(address)
+=C2=A0 =C2=A0 app =3D App(address, args.num_retries, args.retry_delay)

=C2=A0 =C2=A0 =C2=A0if args.log_file:
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0LOGGER.addHandler(logging.FileHandler(arg= s.log_file))
--
2.17.1


Right idea overall - possibly needs so= me polish and to be integrated with an earlier patch to avoid the intermedi= ate FIXMEs.

Thanks,
--js
=
--0000000000003e046505c9ba109f--