From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Ananyev, Konstantin" Subject: Re: [PATCH v2 3/4] eal: add synchronous multi-process communication Date: Wed, 17 Jan 2018 17:20:38 +0000 Message-ID: <2601191342CEEE43887BDE71AB9772588627F12B@irsmsx105.ger.corp.intel.com> References: <1512067450-59203-1-git-send-email-jianfeng.tan@intel.com> <1515643654-129489-1-git-send-email-jianfeng.tan@intel.com> <1515643654-129489-4-git-send-email-jianfeng.tan@intel.com> <2601191342CEEE43887BDE71AB9772588627E0E5@irsmsx105.ger.corp.intel.com> <5733adcf-ef47-2c07-a39c-7eda01add6e0@intel.com> <2601191342CEEE43887BDE71AB9772588627E2D8@irsmsx105.ger.corp.intel.com> <2601191342CEEE43887BDE71AB9772588627EE16@irsmsx105.ger.corp.intel.com> <74ccd840-86af-4dba-e5ba-494017052841@intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Cc: "Richardson, Bruce" , "thomas@monjalon.net" To: "Tan, Jianfeng" , "dev@dpdk.org" , "Burakov, Anatoly" Return-path: Received: from mga07.intel.com (mga07.intel.com [134.134.136.100]) by dpdk.org (Postfix) with ESMTP id 8CC501B208 for ; Wed, 17 Jan 2018 18:20:43 +0100 (CET) In-Reply-To: <74ccd840-86af-4dba-e5ba-494017052841@intel.com> Content-Language: en-US List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" >=20 >=20 > On 1/17/2018 6:50 PM, Ananyev, Konstantin wrote: > > > >>> Hi Jianfeng, > >>> > >>>> -----Original Message----- > >>>> From: Tan, Jianfeng > >>>> Sent: Tuesday, January 16, 2018 8:11 AM > >>>> To: Ananyev, Konstantin ; dev@dpdk.org= ; Burakov, Anatoly > >>>> Cc: Richardson, Bruce ; thomas@monjalon.= net > >>>> Subject: Re: [PATCH v2 3/4] eal: add synchronous multi-process commu= nication > >>>> > >>>> Thank you, Konstantin and Anatoly firstly. Other comments are well > >>>> received and I'll send out a new version. > >>>> > >>>> > >>>> On 1/16/2018 8:00 AM, Ananyev, Konstantin wrote: > >>>>>> We need the synchronous way for multi-process communication, > >>>>>> i.e., blockingly waiting for reply message when we send a request > >>>>>> to the peer process. > >>>>>> > >>>>>> We add two APIs rte_eal_mp_request() and rte_eal_mp_reply() for > >>>>>> such use case. By invoking rte_eal_mp_request(), a request message > >>>>>> is sent out, and then it waits there for a reply message. The > >>>>>> timeout is hard-coded 5 Sec. And the replied message will be copie= d > >>>>>> in the parameters of this API so that the caller can decide how > >>>>>> to translate those information (including params and fds). Note > >>>>>> if a primary process owns multiple secondary processes, this API > >>>>>> will fail. > >>>>>> > >>>>>> The API rte_eal_mp_reply() is always called by an mp action handle= r. > >>>>>> Here we add another parameter for rte_eal_mp_t so that the action > >>>>>> handler knows which peer address to reply. > >>>>>> > >>>>>> We use mutex in rte_eal_mp_request() to guarantee that only one > >>>>>> request is on the fly for one pair of processes. > >>>>> You don't need to do things in such strange and restrictive way. > >>>>> Instead you can do something like that: > >>>>> 1) Introduce new struct, list for it and mutex > >>>>> struct sync_request { > >>>>> int reply_received; > >>>>> char dst[PATH_MAX]; > >>>>> char reply[...]; > >>>>> LIST_ENTRY(sync_request) next; > >>>>> }; > >>>>> > >>>>> static struct > >>>>> LIST_HEAD(list, sync_request); > >>>>> pthread_mutex_t lock; > >>>>> pthead_cond_t cond; > >>>>> } sync_requests; > >>>>> > >>>>> 2) then at request() call: > >>>>> Grab sync_requests.lock > >>>>> Check do we already have a pending request for that destinatio= n, > >>>>> If yes - the release the lock and returns with error. > >>>>> - allocate and init new sync_request struct, set reply_receive= d=3D0 > >>>>> - do send_msg() > >>>>> -then in a cycle: > >>>>> pthread_cond_timed_wait(&sync_requests.cond, &sync_request.loc= k, ×pec); > >>>>> - at return from it check if sync_request.reply_received =3D= =3D 1, if not > >>>>> check if timeout expired and either return a failure or go to the s= tart of the cycle. > >>>>> > >>>>> 3) at mp_handler() if REPLY received - grab sync_request.lock, > >>>>> search through sync_requests.list for dst[] , > >>>>> if found, then set it's reply_received=3D1, copy the received= message into reply > >>>>> and call pthread_cond_braodcast((&sync_requests.cond); > >>>> The only benefit I can see is that now the sender can request to > >>>> multiple receivers at the same time. And it makes things more > >>>> complicated. Do we really need this? > >>> The benefit is that one thread is blocked waiting for response, > >>> your mp_handler can still receive and handle other messages. > >> This can already be done in the original implementation. mp_handler > >> listens for msg, request from the other peer(s), and replies the > >> requests, which is not affected. > >> > >>> Plus as you said - other threads can keep sending messages. > >> For this one, in the original implementation, other threads can still > >> send msg, but not request. I suppose the request is not in a fast path= , > >> why we care to make it fast? > >> > > +int > > +rte_eal_mp_request(const char *action_name, > > + void *params, > > + int len_p, > > + int fds[], > > + int fds_in, > > + int fds_out) > > +{ > > + int i, j; > > + int sockfd; > > + int nprocs; > > + int ret =3D 0; > > + struct mp_msghdr *req; > > + struct timeval tv; > > + char buf[MAX_MSG_LENGTH]; > > + struct mp_msghdr *hdr; > > + > > + RTE_LOG(DEBUG, EAL, "request: %s\n", action_name); > > + > > + if (fds_in > SCM_MAX_FD || fds_out > SCM_MAX_FD) { > > + RTE_LOG(ERR, EAL, "Cannot send more than %d FDs\n", SCM_MAX_FD); > > + rte_errno =3D -E2BIG; > > + return 0; > > + } > > + > > + req =3D format_msg(action_name, params, len_p, fds_in, MP_REQ); > > + if (req =3D=3D NULL) > > + return 0; > > + > > + if ((sockfd =3D open_unix_fd(0)) < 0) { > > + free(req); > > + return 0; > > + } > > + > > + tv.tv_sec =3D 5; /* 5 Secs Timeout */ > > + tv.tv_usec =3D 0; > > + if (setsockopt(sockfd, SOL_SOCKET, SO_RCVTIMEO, > > + (const void *)&tv, sizeof(struct timeval)) < 0) > > + RTE_LOG(INFO, EAL, "Failed to set recv timeout\n"); > > > > I f you set it just for one call, why do you not restore it? >=20 > Yes, original code is buggy, I should have put it into the critical secti= on. >=20 > Do you mean we just create once and use for ever? if yes, we could put > the open and setting into mp_init(). >=20 > > Also I don't think it is a good idea to change it here - > > if you'll make timeout a parameter value - then it could be overwritten > > by different threads. >=20 > For simplicity, I'm not inclined to put the timeout as an parameter > exposing to caller. So if you agree, I'll put it into the mp_init() with > open. My preference would be to have timeout value on a per call basis. For one request user would like to wait no more than 5sec, for another one user would probably be ok to wait forever. >=20 > > > > + > > + /* Only allow one req at a time */ > > + pthread_mutex_lock(&mp_mutex_request); > > + > > + if (rte_eal_process_type() =3D=3D RTE_PROC_PRIMARY) { > > + nprocs =3D 0; > > + for (i =3D 0; i < MAX_SECONDARY_PROCS; ++i) > > + if (!mp_sec_sockets[i]) { > > + j =3D i; > > + nprocs++; > > + } > > + > > + if (nprocs > 1) { > > + RTE_LOG(ERR, EAL, > > + "multi secondary processes not supported\n"); > > + goto free_and_ret; > > + } > > + > > + ret =3D send_msg(sockfd, mp_sec_sockets[j], req, fds); > > > > As I remember - sndmsg() is also blocking call, so under some condition= s you can stall > > there forever. >=20 > From linux's unix_diagram_sendmsg(), we see: > timeo =3D sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT); Ok, but it would have effect only if (msg->msg_flags & MSG_DONTWAIT) !=3D 0= . And for that, as I remember you need your socket in non-blocking mode, no? >=20 > I assume it will not block for datagram unix socket in Linux. But I'm > not sure what it behaves in freebsd. >=20 > Anyway, better to add an explicit setsockopt() to make it not blocking. You can't do that - at the same moment another thread might call your sendm= sg() and it might expect it to be blocking call. >=20 > > As mp_mutex_requestis still held - next rte_eal_mp_request(0 will also = block forever here. > > > > + } else > > + ret =3D send_msg(sockfd, eal_mp_unix_path(), req, fds); > > + > > + if (ret =3D=3D 0) { > > + RTE_LOG(ERR, EAL, "failed to send request: %s\n", action_name); > > + ret =3D -1; > > + goto free_and_ret; > > + } > > + > > + ret =3D read_msg(sockfd, buf, MAX_MSG_LENGTH, fds, fds_out, NULL); > > > > if the message you receive is not a reply you are expecting - > > it will be simply dropped - mp_handler() would never process it. >=20 > We cannot detect if it's the right reply absolutely correctly, but just > check the action_name, which means, it still possibly gets a wrong reply > if an action_name contains multiple requests. >=20 > Is just comparing the action_name acceptable? As I can see the main issue here is that you can call recvmsg() from 2 diff= erent points and they are not syncronised: 1. your mp_handler() doesn't aware about reply you are waiting and not=20 have any handler associated with it. So if mp_handler() will receive a reply it will just drop it. 2. your reply() is not aware about any other messages and associated action= s - so again it can't handle them properly (and probably shouldn't). The simplest (and most common) way - always call recvmsg from one place - mp_handler() and have a special action for reply msg. As I wrote before that action will be just find the appropriate buffer prov= ided by reply() - copy message into it and signal thread waiting in reply() that it can proceed. =20 Konstantin >=20 > > > > + if (ret > 0) { > > + hdr =3D (struct mp_msghdr *)buf; > > + if (hdr->len_params =3D=3D len_p) > > + memcpy(params, hdr->params, len_p); > > + else { > > + RTE_LOG(ERR, EAL, "invalid reply\n"); > > + ret =3D 0; > > + } > > + } > > + > > +free_and_ret: > > + free(req); > > + close(sockfd); > > + pthread_mutex_unlock(&mp_mutex_request); > > + return ret; > > +} > > > > All of the above makes me think that current implementation is erroneou= s > > and needs to be reworked. >=20 > Thank you for your review. I'll work on a new version. >=20 > Thanks, > Jianfeng >=20 > > Konstantin > > > >