From mboxrd@z Thu Jan  1 00:00:00 1970
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: Interrupt code cleanup [RFC]
Date: Wed, 31 Aug 2011 16:52:13 +0100
Message-ID: <4E5E58AD.9060500@citrix.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xensource.com>
List-Unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xensource.com>
List-Help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-Subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
Sender: xen-devel-bounces@lists.xensource.com
Errors-To: xen-devel-bounces@lists.xensource.com
To: "xen-devel@lists.xensource.com" <xen-devel@lists.xensource.com>, Keir Fraser <keir.xen@gmail.com>
List-Id: xen-devel@lists.xenproject.org

Following several recent subtle but serious bugs in the core of the Xen
interrupt code, I am starting an extended project to clean up the code,
remove the false logic (there are several areas where the logic is wrong
now that per-cpu IDTs have been introduced), and to try and document
some of the code and general flow.

So far, I have already done:

1) Fold irq_status into irq_cfg.  This saves 744 bytes per CPU of data
which is frequently referenced.
2) Introduce irq_cfg.old_vector for the same reason that cpu_mask and
old_cpu_mask exist.  This removes a brute force search in
__clear_irq_vector() and allows for 3)
3) Remove irq_vector and replace functionality with irq_cfg.vector (and
old_vector where relevant).  This is because the logic surrounding
irq_vector is false with per-cpu IDTs


I have done some preliminary analysis of the code and data (before
changes listed above): (Assuming 64bit, NR_CPUS = 256 and not using a
debug build)

* Size of irq_desc is 140 bytes of data (cache aligned to 64 bytes), so
3 cache lines.
* Size of irq_cfg is 83 bytes of data (non cache aligned), so 2
(possibly 3) cache lines.
* irq_desc.affinity and pending_mask seem functionally related to
irq_cfg but are in a separate structure.
* irq_desc.status and irq_cfg.move_in_progress have overlapping
information which could possibly be simplified.
* irq_desc.irq duplicates data which almost always a parameter to irq
related functions, and trivial to calculate using pointer arithmetic if
not otherwise available. (irq_desc[irq] === irq across the codebase)
* irq_desc.chip_data is a pointer to an irq_cfg which is always
available using the irq number and irq_cfg array.
(irq_desc[irq].chip_data === irq_cfg[irq] across the codebase)

While the above is not a specific indication of things which need to be
changed, it is interesting to note that irq_desc without the irq and
chip_data fields becomes 128 bytes.  Also, if the cpumask_t's were moved
to irq_cfg, then irq_desc would be 64 bytes and wouldn't scale with
NR_CPUS at all.  Having said that, my real question is "is there a good
reason why irq_desc and irq_cfg are separate structures?".  They are
both referenced in all interrupt code paths.

At the moment, all cache alignment (including xmalloc and friends) is
done by the compiler on the assumption that the cache line length is 64
bytes.  Is it sensible to try and pull the L1 cache line size out of
cpuid at boot time and work with that?  Are there many systems in common
use which do not have 64 byte cache line sizes?

One error which I am aware of, and will be sorting, is to do with
allocation of vectors for gsi's.  The EOI broadcast from the Local APIC
to the IO-APICs only takes into account the vector, not the target CPU,
meaning that two different interrupts with the same vector but different
CPUs will cross EOI each other.

Are there any areas people have noticed where the irq code is limited
and could be improved?  One which comes to mind is the fact that at the
moment, an irq migration requires the same free vector on all target
PCPUs which becomes problematic if the affinity is for more than 1 CPU
on systems with large numbers of virtual functions with large numbers of
interrupts each.  (I have seen one case where this problem caused a
guest because Xen could not allocate a vector to pass through, even
though it had room to do so.  I fear the problem will get worse in the
future)


Sorry if this appears a bit jumbled - It is still a bit of a brainstorm
at the moment, but I wanted peoples views/opinions.

-- 
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com