From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ian Jackson Subject: Re: [PATCH] libxl: Increase device model startup timeout to 1min. Date: Fri, 3 Jul 2015 12:30:14 +0100 Message-ID: <21910.29254.453905.459416@mariner.uk.xensource.com> References: <1435336867.32500.209.camel@citrix.com> <20150629142317.GB1891@perard.uk.xensource.com> <1435589517.32500.342.camel@citrix.com> <20150629160919.GC1891@perard.uk.xensource.com> <21906.41470.414971.681618@mariner.uk.xensource.com> <21906.50807.907706.819950@mariner.uk.xensource.com> <20150702111148.GG1891@perard.uk.xensource.com> <21909.12493.239202.740226@mariner.uk.xensource.com> <20150703112150.GI1891@perard.uk.xensource.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20150703112150.GI1891@perard.uk.xensource.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Anthony PERARD Cc: xen-devel@lists.xen.org, Wei Liu , Ian Campbell , Stefano Stabellini List-Id: xen-devel@lists.xenproject.org Anthony PERARD writes ("Re: [PATCH] libxl: Increase device model startup timeout to 1min."): > On Thu, Jul 02, 2015 at 01:38:37PM +0100, Ian Jackson wrote: > > I'm starting to think that this might be a real bug but that the bug > > might be "Linux's I/O subsystem sometimes produces appalling latency > > under load" (which is hardly news). > > I guess the straces support this, here are few quote from different strace: ... > 04:11:50.602639 mmap(0x7f845bc29000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x31000) = 0x7f845bc29000 <0.000038> > 04:11:51.257654 close(3) = 0 <0.000042> ... > The first quote is a pattern I'm seeing very often on slow dm start, where > it take a long time between the mmap and the next syscall. On the second > quote, read() is to blame, it took 1s. > > I guess even the first quote imply there is going to be I/O after the mmap > call, isn't it? It's very likely, yes. The code after mmap will probably start reading the pages just mapped. Thanks for this investigation. I am now convinced that this is indeed the bug "Linux's I/O subsystem sometimes produces appalling latency under load". That bug has existed for at least a decade and seems unlikely to be fixed any time soon. Certainly, fixing it is beyond our scope. So papering over this with an increase in the timeout is probably proper. I'm tempted to suggest increasing the timeout only on Linux. Ian.