From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755856AbdDRVkl (ORCPT ); Tue, 18 Apr 2017 17:40:41 -0400 Received: from linode.cmadams.net ([209.123.162.222]:41430 "EHLO linode.cmadams.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752537AbdDRVkj (ORCPT ); Tue, 18 Apr 2017 17:40:39 -0400 X-Greylist: delayed 756 seconds by postgrey-1.27 at vger.kernel.org; Tue, 18 Apr 2017 17:40:39 EDT Date: Tue, 18 Apr 2017 16:27:56 -0500 From: Chris Adams To: linux-kernel@vger.kernel.org Subject: Latency in logical volume layer? Message-ID: <20170418212756.GC1452@cmadams.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.20 (2009-12-10) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org I am trying to figure out a storage latency issue I am seeing with oVirt and iSCSI storage, and I am looking for a little help (or to be told "you're doing it wrong" as usual). I have an oVirt virtualization cluster running with 7 CentOS 7 servers, a dedicated storage LAN (separate switches), and iSCSI multipath running to a SAN. Occasionally, at times when there's no apparent load spike or anything, oVirt will report 5+ second latency accessing a storage domain. I can't see any network issue or problem at the SAN, so I started looking at Linux. oVirt reports this when it tries to read the storage domain metadata. With iSCSI storage, oVirt access it via multipath, and treats the whole device as a PV for Linux LVM (no partitioning). The metadata is a small LV that each node reads the first 4K from every few seconds (using O_DIRECT to avoid caching). I wrote a perl script to replicate this access pattern (open with O_DIRECT, read the first 4K, close) and report times. I do see higher than expected latency sometimes - 50-200ms latency happens fairly regularly. I added doing the same open/read/close on the PV (the multipath device), and I do not see the same latency there. It is a very consistent 0.25-0.55ms latency. I put a host in maintenance mode, and disabled multipath, and I saw similar behavior (comparing reads from the raw SCSI device and the LV device). I am testing on a host with no VMs. I do sometimes (not always) see similar latency on multiple hosts (others are running VMs) simultaneously. That's where I'm lost - how does going up the stack from the multipath device to the LV add so much latency (but not all the time)? I recognize that the CentOS 7 kernel is not mainline, but was hoping that maybe somebody would say "that's a known thing", or "that's expected", or "you're measuring wrong". Any suggestions, places to look, etc.? Thanks. -- Chris Adams