linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* RCU issue with SELinux (Re: SELINUX performance issues)
@ 2004-08-16  9:33 Kaigai Kohei
  2004-08-16 15:19 ` James Morris
  0 siblings, 1 reply; 31+ messages in thread
From: Kaigai Kohei @ 2004-08-16  9:33 UTC (permalink / raw)
  To: SELinux-ML(Eng), Linux Kernel ML(Eng); +Cc: James Morris

Hello, everyone.

Sat, 7 Aug 2004 22:57:08 -0400 (EDT)
James Morris <jmorris@redhat.com> wrote:

> > The biggest problem is the global lock:
> > 
> > avc_has_perm_noaudit:
> >         spin_lock_irqsave(&avc_lock, flags);
> > 
> > Any chance we can get rid of it? Maybe with RCU?
> 
> Yes, known problem.  I plan on trying RCU soon, Rik was looking at a 
> seqlock approach.

I'm interested in the scalability of SELinux, and tried with
rwlock and RCU approaches.

I simply replaced spinlock_irq_save() by (read|write)_lock_irqsave() first,
but performance improvement was observed in the hackbench only,not in OSDL-REAIM.

Next, I tried with RCU approach. I came across the following problem.

Some AVC-Entries are referred directly by avc_entry_ref structure
in various resource objects (such as task_struct, inode and so on...). 
Thus, referring to invalidated AVC-Entries may happen after detaching
an entry from the AVC hash list.
Since only list scanning of forward direction is expected in RCU-model,
direct reference to AVC-Entry is not appropriate.

In my opinion, direct reference to AVC-Entry should be removed
to avoid the problem for scalability of SELinux.
The purpose of this direct reference is performance improvement
in consecutive access control check about each related object.
Performance degradation may happen by this.
But I think it is not so significant, because the number of the hash
slot is 512 in spite of that the number of AVC-Entry is 410 fixed.
We can reach the target AVC-Entry by one or two steps in average.

Is removing direct reference to AVC-Entry approach acceptable?

I'll try to consider this issue further.
--------
Kai Gai, Linux Promotion Center, NEC
E-mail: kaigai@ak.jp.nec.com


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RCU issue with SELinux (Re: SELINUX performance issues)
  2004-08-16  9:33 RCU issue with SELinux (Re: SELINUX performance issues) Kaigai Kohei
@ 2004-08-16 15:19 ` James Morris
  2004-08-20 13:36   ` Kaigai Kohei
  0 siblings, 1 reply; 31+ messages in thread
From: James Morris @ 2004-08-16 15:19 UTC (permalink / raw)
  To: Kaigai Kohei; +Cc: SELinux-ML(Eng), Linux Kernel ML(Eng), Stephen Smalley

On Mon, 16 Aug 2004, Kaigai Kohei wrote:

> Is removing direct reference to AVC-Entry approach acceptable?
> 
> I'll try to consider this issue further.

Sure, if you can make it work without problems.



- James
-- 
James Morris
<jmorris@redhat.com>



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RCU issue with SELinux (Re: SELINUX performance issues)
  2004-08-16 15:19 ` James Morris
@ 2004-08-20 13:36   ` Kaigai Kohei
  2004-08-20 14:53     ` James Morris
                       ` (3 more replies)
  0 siblings, 4 replies; 31+ messages in thread
From: Kaigai Kohei @ 2004-08-20 13:36 UTC (permalink / raw)
  To: James Morris, Stephen Smalley; +Cc: SELinux-ML(Eng), Linux Kernel ML(Eng)

[-- Attachment #1: Type: text/plain, Size: 2883 bytes --]

Hello, everyone.

Tuesday, August 17, 2004 12:19 AM
James Morris wrote:
> > Is removing direct reference to AVC-Entry approach acceptable?
> > 
> > I'll try to consider this issue further.
> 
> Sure, if you can make it work without problems.

The attached patches against to 2.6.8.1 kernel improve the performance
and the scalability of SELinux by RCU-approach.

The evaluation results are as follows:
<Environment>
CPU: Itanium2(1GHz) x 4/8/16/32
Memory: enough (no swap)
OS: 2.6.8.1 (SELinux disabled by 'selinux=0' boot option)
    2.6.8.1 (SELinux enabled)
    2.6.8.1 + rwlock patch by KaiGai
    2.6.8.1 + RCU patch by KaiGai

The test program iterates write() to files on tmpfs 500,000 times in parallel.
It is attached as 'rcutest.c'.

We observed performance improvement and perfect scalability.

[rcutest Loop=500000 Parallel=<Num of CPU>]
  - upper column is the RealTime [second].
  - lefthand of lower column is the UserTime, righthand is the SysTime.
                 --- 4CPU ---  --- 8CPU ---  --- 16CPU ---  --- 32CPU ---
2.6.8.1(disable)    8.0158[s]     8.0183[s]      8.0146[s]      8.0037[s]
                 (2.08/ 6.07)   (1.86/6.31)   (0.78/ 7.33)    (2.03/5.94)
2.6.8.1(enable)    78.0957[s]   319.0451[s]   1322.0313[s]      too long 
                 (2.47/76.48) (2.47/316.96)  (2.43/1319.8)     (---/---) 
2.6.8.1.rwlock     20.0100[s]    49.0390[s]     90.0200[s]    177.0533[s]
                 (2.57/17.52)  (2.45/46.93)   (2.37/87.78)   (2.41/175.1)
2.6.8.1.rcu        11.0464[s]    11.0205[s]     11.0372[s]     11.0496[s]
                  (2.64/8.82)   (2.21/8.98)   (2.67/ 8.68)    (2.51/8.95)
-------------------------------------------------------------------------


These patches achieve lock-free read access to AVC cache.

Significant changes are as follows:
- Direct references by avc_entry_ref were removed for RCU.
- Some structures(avc_entry/avc_node/avc_cache) were changed for RCU.
- avc_node type objects are allocated dynamically, not statically.
  avc_reclaim_node() reclaims some avc_node objects when the number of
  objects exceeds AVC_CACHE_THRESHOLD.
- For updating av_decision structure, I came up with a new RCU primitive,
  list_replace_rcu() in include/linux/list.h .
  It is needed to switch the linked pointer atomically, and replace
  an old object with a new one. list_replace_rcu() is used in avc_update_node()
  to update av_decision structure, and an original object is duplicated and updated.
  We now pay more cost for updating av_decision structure as a compensation
  for lock-free read access. But, in my opinion, updating av_decision is so rare.

NOTE: The fifth arguments of avc_has_perm() and avc_has_perm_noaudit()
      are not necessary. But the prototype declarations and all the callers
      remain, since I want to avoid messing up the patches.

Any comments?
----------
KaiGai <kaigai@ak.jp.nec.com>

[-- Attachment #2: selinux.rcu-2.6.8.1.patch --]
[-- Type: application/octet-stream, Size: 21368 bytes --]

diff -rNU4 linux-2.6.8.1/security/selinux/avc.c linux-2.6.8.1.rcu/security/selinux/avc.c
--- linux-2.6.8.1/security/selinux/avc.c	2004-08-14 19:55:48.000000000 +0900
+++ linux-2.6.8.1.rcu/security/selinux/avc.c	2004-08-20 20:38:06.000000000 +0900
@@ -35,28 +35,31 @@
 #include "av_perm_to_string.h"
 #include "objsec.h"
 
 #define AVC_CACHE_SLOTS		512
-#define AVC_CACHE_MAXNODES	410
+#define AVC_CACHE_THRESHOLD	410
+#define AVC_CACHE_RECLAIM	16
 
 struct avc_entry {
 	u32			ssid;
 	u32			tsid;
 	u16			tclass;
 	struct av_decision	avd;
-	int			used;	/* used recently */
+	atomic_t		used;	/* used recently */
 };
 
 struct avc_node {
+	struct list_head	list;
+	struct rcu_head		rhead;
 	struct avc_entry	ae;
-	struct avc_node		*next;
 };
 
 struct avc_cache {
-	struct avc_node	*slots[AVC_CACHE_SLOTS];
-	u32		lru_hint;	/* LRU hint for reclaim scan */
-	u32		active_nodes;
-	u32		latest_notif;	/* latest revocation notification */
+	struct list_head	slots[AVC_CACHE_SLOTS];
+	spinlock_t		slots_lock[AVC_CACHE_SLOTS];	/* Lock for RCUupdate */
+	atomic_t		lru_hint;	/* LRU hint for reclaim scan */
+	atomic_t		active_nodes;
+	u32			latest_notif;	/* latest revocation notification */
 };
 
 struct avc_callback_node {
 	int (*callback) (u32 event, u32 ssid, u32 tsid,
@@ -69,10 +72,8 @@
 	u32 perms;
 	struct avc_callback_node *next;
 };
 
-static spinlock_t avc_lock = SPIN_LOCK_UNLOCKED;
-static struct avc_node *avc_node_freelist;
 static struct avc_cache avc_cache;
 static unsigned avc_cache_stats[AVC_NSTATS];
 static struct avc_callback_node *avc_callbacks;
 
@@ -187,52 +188,44 @@
  * Initialize the access vector cache.
  */
 void __init avc_init(void)
 {
-	struct avc_node	*new;
 	int i;
 
-	for (i = 0; i < AVC_CACHE_MAXNODES; i++) {
-		new = kmalloc(sizeof(*new), GFP_ATOMIC);
-		if (!new) {
-			printk(KERN_WARNING "avc:  only able to allocate "
-			       "%d entries\n", i);
-			break;
-		}
-		memset(new, 0, sizeof(*new));
-		new->next = avc_node_freelist;
-		avc_node_freelist = new;
-	}
-
+	for (i =0; i < AVC_CACHE_SLOTS; i++) {
+		INIT_LIST_HEAD(&avc_cache.slots[i]);
+		avc_cache.slots_lock[i] = SPIN_LOCK_UNLOCKED;
+	}
+	atomic_set(&avc_cache.active_nodes, 0);
+	atomic_set(&avc_cache.lru_hint, 0);
+	
 	audit_log(current->audit_context, "AVC INITIALIZED\n");
 }
 
 #if 0
 static void avc_hash_eval(char *tag)
 {
 	int i, chain_len, max_chain_len, slots_used;
 	struct avc_node *node;
+	struct list_head *pos;
 	unsigned long flags;
 
-	spin_lock_irqsave(&avc_lock,flags);
+	rcu_read_lock();
 
 	slots_used = 0;
 	max_chain_len = 0;
 	for (i = 0; i < AVC_CACHE_SLOTS; i++) {
-		node = avc_cache.slots[i];
-		if (node) {
+		if (!list_empty(&avc_cache.slots[i])) {
 			slots_used++;
 			chain_len = 0;
-			while (node) {
+			list_for_each_rcu(pos, &avc_cache.slots[i])
 				chain_len++;
-				node = node->next;
-			}
 			if (chain_len > max_chain_len)
 				max_chain_len = chain_len;
 		}
 	}
 
-	spin_unlock_irqrestore(&avc_lock,flags);
+	rcu_read_unlock();
 
 	printk(KERN_INFO "\n");
 	printk(KERN_INFO "%s avc:  %d entries and %d/%d buckets used, longest "
 	       "chain length %d\n", tag, avc_cache.active_nodes, slots_used,
@@ -242,188 +235,194 @@
 static inline void avc_hash_eval(char *tag)
 { }
 #endif
 
-static inline struct avc_node *avc_reclaim_node(void)
-{
-	struct avc_node *prev, *cur;
-	int hvalue, try;
+static void avc_node_free(struct rcu_head *rhead) {
+	struct avc_node *node;
+	node = container_of(rhead, struct avc_node, rhead);
+	kfree(node);
+}
 
-	hvalue = avc_cache.lru_hint;
-	for (try = 0; try < 2; try++) {
-		do {
-			prev = NULL;
-			cur = avc_cache.slots[hvalue];
-			while (cur) {
-				if (!cur->ae.used)
-					goto found;
+static inline void avc_reclaim_node(void)
+{
+	struct avc_node *node;
+	struct list_head *pos;
+	int hvalue, try, ecx;
 
-				cur->ae.used = 0;
+	for (try = 0, ecx = 0; try < AVC_CACHE_SLOTS; try++ ) {
+		hvalue = atomic_inc_return(&avc_cache.lru_hint) % AVC_CACHE_SLOTS;
 
-				prev = cur;
-				cur = cur->next;
+		if (spin_trylock(&avc_cache.slots_lock[hvalue])) {
+			list_for_each_rcu(pos, &avc_cache.slots[hvalue]) {
+				node = list_entry(pos, struct avc_node, list);
+				if (!atomic_dec_and_test(&node->ae.used)) {
+					/* Recently Unused */
+					list_del_rcu(&node->list);
+					call_rcu(&node->rhead, avc_node_free);
+					atomic_dec(&avc_cache.active_nodes);
+					ecx++;
+					if (ecx >= AVC_CACHE_RECLAIM) {
+						spin_unlock(&avc_cache.slots_lock[hvalue]);
+						goto out;
+					}
+				}
 			}
-			hvalue = (hvalue + 1) & (AVC_CACHE_SLOTS - 1);
-		} while (hvalue != avc_cache.lru_hint);
+			spin_unlock(&avc_cache.slots_lock[hvalue]);
+		}
 	}
-
-	panic("avc_reclaim_node");
-
-found:
-	avc_cache.lru_hint = hvalue;
-
-	if (prev == NULL)
-		avc_cache.slots[hvalue] = cur->next;
-	else
-		prev->next = cur->next;
-
-	return cur;
+out:
+	return;
 }
 
-static inline struct avc_node *avc_claim_node(u32 ssid,
-                                              u32 tsid, u16 tclass)
+static inline struct avc_node *avc_get_node(void)
 {
 	struct avc_node *new;
-	int hvalue;
+	int actives;
 
-	hvalue = avc_hash(ssid, tsid, tclass);
-	if (avc_node_freelist) {
-		new = avc_node_freelist;
-		avc_node_freelist = avc_node_freelist->next;
-		avc_cache.active_nodes++;
-	} else {
-		new = avc_reclaim_node();
-		if (!new)
-			goto out;
+	new = kmalloc(sizeof(struct avc_node), GFP_ATOMIC);
+	if (new) {
+		actives = atomic_inc_return(&avc_cache.active_nodes);
+		if (actives > AVC_CACHE_THRESHOLD)
+			avc_reclaim_node();
 	}
 
-	new->ae.used = 1;
-	new->ae.ssid = ssid;
-	new->ae.tsid = tsid;
-	new->ae.tclass = tclass;
-	new->next = avc_cache.slots[hvalue];
-	avc_cache.slots[hvalue] = new;
-
-out:
 	return new;
 }
 
 static inline struct avc_node *avc_search_node(u32 ssid, u32 tsid,
                                                u16 tclass, int *probes)
 {
-	struct avc_node *cur;
+	struct list_head *pos;
+	struct avc_node *node, *ret = NULL;
 	int hvalue;
 	int tprobes = 1;
 
 	hvalue = avc_hash(ssid, tsid, tclass);
-	cur = avc_cache.slots[hvalue];
-	while (cur != NULL &&
-	       (ssid != cur->ae.ssid ||
-		tclass != cur->ae.tclass ||
-		tsid != cur->ae.tsid)) {
+	list_for_each_rcu(pos, &avc_cache.slots[hvalue]) {
+		node = list_entry(pos, struct avc_node, list);
+		if (ssid == node->ae.ssid &&
+		    tclass == node->ae.tclass &&
+		    tsid == node->ae.tsid) {
+			ret = node;
+			break;
+		}
 		tprobes++;
-		cur = cur->next;
 	}
 
-	if (cur == NULL) {
+	if (ret == NULL) {
 		/* cache miss */
 		goto out;
 	}
 
 	/* cache hit */
 	if (probes)
 		*probes = tprobes;
-
-	cur->ae.used = 1;
-
+	if (atomic_read(&ret->ae.used) != 1)
+		atomic_set(&ret->ae.used, 1);
 out:
-	return cur;
+	return ret;
 }
 
 /**
  * avc_lookup - Look up an AVC entry.
  * @ssid: source security identifier
  * @tsid: target security identifier
  * @tclass: target security class
  * @requested: requested permissions, interpreted based on @tclass
- * @aeref:  AVC entry reference
  *
  * Look up an AVC entry that is valid for the
  * @requested permissions between the SID pair
  * (@ssid, @tsid), interpreting the permissions
  * based on @tclass.  If a valid AVC entry exists,
- * then this function updates @aeref to refer to the
- * entry and returns %0. Otherwise, this function
- * returns -%ENOENT.
+ * then this function return the avc_node.
+ * Otherwise, this function returns NULL.
  */
-int avc_lookup(u32 ssid, u32 tsid, u16 tclass,
-               u32 requested, struct avc_entry_ref *aeref)
+struct avc_node *avc_lookup(u32 ssid, u32 tsid, u16 tclass, u32 requested)
 {
 	struct avc_node *node;
-	int probes, rc = 0;
+	int probes;
 
 	avc_cache_stats_incr(AVC_CAV_LOOKUPS);
 	node = avc_search_node(ssid, tsid, tclass,&probes);
 
 	if (node && ((node->ae.avd.decided & requested) == requested)) {
 		avc_cache_stats_incr(AVC_CAV_HITS);
 		avc_cache_stats_add(AVC_CAV_PROBES,probes);
-		aeref->ae = &node->ae;
 		goto out;
 	}
 
 	avc_cache_stats_incr(AVC_CAV_MISSES);
-	rc = -ENOENT;
 out:
-	return rc;
+	return node;
+}
+
+static int avc_latest_notif_update(int seqno, int is_insert)
+{
+	int ret = 0;
+	static spinlock_t notif_lock = SPIN_LOCK_UNLOCKED;
+
+	spin_lock(&notif_lock);
+	if (seqno < avc_cache.latest_notif) {
+		if (is_insert) {
+			printk(KERN_WARNING "avc:  seqno %d < latest_notif %d\n",
+			       seqno, avc_cache.latest_notif);
+			ret = -EAGAIN;
+		} else {
+			avc_cache.latest_notif = seqno;
+		}
+	}
+	spin_unlock(&notif_lock);
+	return ret;
 }
 
 /**
  * avc_insert - Insert an AVC entry.
  * @ssid: source security identifier
  * @tsid: target security identifier
  * @tclass: target security class
  * @ae: AVC entry
- * @aeref:  AVC entry reference
  *
  * Insert an AVC entry for the SID pair
  * (@ssid, @tsid) and class @tclass.
  * The access vectors and the sequence number are
  * normally provided by the security server in
  * response to a security_compute_av() call.  If the
  * sequence number @ae->avd.seqno is not less than the latest
  * revocation notification, then the function copies
- * the access vectors into a cache entry, updates
- * @aeref to refer to the entry, and returns %0.
- * Otherwise, this function returns -%EAGAIN.
+ * the access vectors into a cache entry, returns
+ * avc_node inserted. Otherwise, this function returns NULL.
  */
-int avc_insert(u32 ssid, u32 tsid, u16 tclass,
-               struct avc_entry *ae, struct avc_entry_ref *aeref)
+struct avc_node *avc_insert(u32 ssid, u32 tsid, u16 tclass, struct avc_entry *ae)
 {
-	struct avc_node *node;
-	int rc = 0;
+	struct avc_node *node = NULL;
+	int hvalue;
 
-	if (ae->avd.seqno < avc_cache.latest_notif) {
-		printk(KERN_WARNING "avc:  seqno %d < latest_notif %d\n",
-		       ae->avd.seqno, avc_cache.latest_notif);
-		rc = -EAGAIN;
+	if (avc_latest_notif_update(ae->avd.seqno, 1))
 		goto out;
-	}
 
-	node = avc_claim_node(ssid, tsid, tclass);
-	if (!node) {
-		rc = -ENOMEM;
-		goto out;
-	}
+	node = avc_get_node();
+
+	if (node) {
+		hvalue = avc_hash(ssid, tsid, tclass);
+
+		INIT_LIST_HEAD(&node->list);
+		INIT_RCU_HEAD(&node->rhead);
 
-	node->ae.avd.allowed = ae->avd.allowed;
-	node->ae.avd.decided = ae->avd.decided;
-	node->ae.avd.auditallow = ae->avd.auditallow;
-	node->ae.avd.auditdeny = ae->avd.auditdeny;
-	node->ae.avd.seqno = ae->avd.seqno;
-	aeref->ae = &node->ae;
+		node->ae.ssid = ssid;
+		node->ae.tsid = tsid;
+		node->ae.tclass = tclass;
+		atomic_set(&node->ae.used, 1);
+
+		node->ae.avd.allowed = ae->avd.allowed;
+		node->ae.avd.decided = ae->avd.decided;
+		node->ae.avd.auditallow = ae->avd.auditallow;
+		node->ae.avd.auditdeny = ae->avd.auditdeny;
+		node->ae.avd.seqno = ae->avd.seqno;
+
+		list_add_rcu(&node->list, &avc_cache.slots[hvalue]);
+	}
 out:
-	return rc;
+	return node;
 }
 
 static inline void avc_print_ipv6_addr(struct audit_buffer *ab,
 				       struct in6_addr *addr, u16 port,
@@ -685,63 +684,99 @@
 {
 	return (x == y || x == SECSID_WILD || y == SECSID_WILD);
 }
 
-static inline void avc_update_node(u32 event, struct avc_node *node, u32 perms)
+static void avc_update_node(u32 event, u32 perms, u32 ssid, u32 tsid, u16 tclass)
 {
-	switch (event) {
-	case AVC_CALLBACK_GRANT:
-		node->ae.avd.allowed |= perms;
-		break;
-	case AVC_CALLBACK_TRY_REVOKE:
-	case AVC_CALLBACK_REVOKE:
-		node->ae.avd.allowed &= ~perms;
-		break;
-	case AVC_CALLBACK_AUDITALLOW_ENABLE:
-		node->ae.avd.auditallow |= perms;
-		break;
-	case AVC_CALLBACK_AUDITALLOW_DISABLE:
-		node->ae.avd.auditallow &= ~perms;
-		break;
-	case AVC_CALLBACK_AUDITDENY_ENABLE:
-		node->ae.avd.auditdeny |= perms;
-		break;
-	case AVC_CALLBACK_AUDITDENY_DISABLE:
-		node->ae.avd.auditdeny &= ~perms;
-		break;
+	int hvalue;
+	struct list_head *pos;
+	struct avc_node *new, *node, *org = NULL;
+
+	new = kmalloc(sizeof(struct avc_node), GFP_ATOMIC);
+	hvalue = avc_hash(ssid, tsid, tclass);
+
+	/* Lock the target slot */
+	spin_lock(&avc_cache.slots_lock[hvalue]);
+	list_for_each(pos, &avc_cache.slots[hvalue]) {
+		node = list_entry(pos, struct avc_node, list);
+		if (ssid==node->ae.ssid &&
+		    tsid==node->ae.tsid &&
+		    tclass==node->ae.tclass ) {
+			org = node;
+			break;
+		}
+	}
+
+	if (org) {
+		if (!new) {
+			list_del_rcu(&org->list);
+			call_rcu(&org->rhead, avc_node_free);
+			goto out;
+		}
+
+		/* Duplicate and Update */
+		memcpy(new, org, sizeof(struct avc_node));
+
+		switch (event) {
+		case AVC_CALLBACK_GRANT:
+			new->ae.avd.allowed |= perms;
+			break;
+		case AVC_CALLBACK_TRY_REVOKE:
+		case AVC_CALLBACK_REVOKE:
+			new->ae.avd.allowed &= ~perms;
+			break;
+		case AVC_CALLBACK_AUDITALLOW_ENABLE:
+			new->ae.avd.auditallow |= perms;
+			break;
+		case AVC_CALLBACK_AUDITALLOW_DISABLE:
+			new->ae.avd.auditallow &= ~perms;
+			break;
+		case AVC_CALLBACK_AUDITDENY_ENABLE:
+			new->ae.avd.auditdeny |= perms;
+			break;
+		case AVC_CALLBACK_AUDITDENY_DISABLE:
+			new->ae.avd.auditdeny &= ~perms;
+			break;
+		}
+
+		list_replace_rcu(&org->list,&new->list);
+		call_rcu(&org->rhead, avc_node_free);
 	}
+out:
+	spin_unlock(&avc_cache.slots_lock[hvalue]);
 }
 
 static int avc_update_cache(u32 event, u32 ssid, u32 tsid,
                             u16 tclass, u32 perms)
 {
+	struct list_head *pos;
 	struct avc_node *node;
 	int i;
-	unsigned long flags;
 
-	spin_lock_irqsave(&avc_lock,flags);
+	rcu_read_lock();
 
 	if (ssid == SECSID_WILD || tsid == SECSID_WILD) {
 		/* apply to all matching nodes */
 		for (i = 0; i < AVC_CACHE_SLOTS; i++) {
-			for (node = avc_cache.slots[i]; node;
-			     node = node->next) {
+			list_for_each_rcu(pos, &avc_cache.slots[i] ) {
+				node = list_entry(pos, struct avc_node, list);
 				if (avc_sidcmp(ssid, node->ae.ssid) &&
 				    avc_sidcmp(tsid, node->ae.tsid) &&
-				    tclass == node->ae.tclass) {
-					avc_update_node(event,node,perms);
+				    tclass == node->ae.tclass ) {
+					avc_update_node(event, perms, node->ae.ssid
+					                ,node->ae.tsid, node->ae.tclass);
 				}
 			}
 		}
 	} else {
 		/* apply to one node */
 		node = avc_search_node(ssid, tsid, tclass, NULL);
 		if (node) {
-			avc_update_node(event,node,perms);
+			avc_update_node(event, perms, ssid, tsid, tclass);
 		}
 	}
 
-	spin_unlock_irqrestore(&avc_lock,flags);
+	rcu_read_unlock();
 
 	return 0;
 }
 
@@ -751,9 +786,8 @@
 {
 	struct avc_callback_node *c;
 	u32 tretained = 0, cretained = 0;
 	int rc = 0;
-	unsigned long flags;
 
 	/*
 	 * try_revoke only removes permissions from the cache
 	 * state if they are not retained by the object manager.
@@ -786,12 +820,9 @@
 		avc_update_cache(event,ssid,tsid,tclass,perms);
 		*out_retained = tretained;
 	}
 
-	spin_lock_irqsave(&avc_lock,flags);
-	if (seqno > avc_cache.latest_notif)
-		avc_cache.latest_notif = seqno;
-	spin_unlock_irqrestore(&avc_lock,flags);
+	avc_latest_notif_update(seqno, 0);
 
 out:
 	return rc;
 }
@@ -856,34 +887,24 @@
 int avc_ss_reset(u32 seqno)
 {
 	struct avc_callback_node *c;
 	int i, rc = 0;
-	struct avc_node *node, *tmp;
-	unsigned long flags;
+	struct list_head *pos;
+	struct avc_node *node;
 
 	avc_hash_eval("reset");
 
-	spin_lock_irqsave(&avc_lock,flags);
+	rcu_read_lock();
 
 	for (i = 0; i < AVC_CACHE_SLOTS; i++) {
-		node = avc_cache.slots[i];
-		while (node) {
-			tmp = node;
-			node = node->next;
-			tmp->ae.ssid = tmp->ae.tsid = SECSID_NULL;
-			tmp->ae.tclass = SECCLASS_NULL;
-			tmp->ae.avd.allowed = tmp->ae.avd.decided = 0;
-			tmp->ae.avd.auditallow = tmp->ae.avd.auditdeny = 0;
-			tmp->ae.used = 0;
-			tmp->next = avc_node_freelist;
-			avc_node_freelist = tmp;
-			avc_cache.active_nodes--;
+		list_for_each_rcu(pos, &avc_cache.slots[i]) {
+			node = list_entry(pos, struct avc_node, list);
+			list_del_rcu(&node->list);
+			call_rcu(&node->rhead, avc_node_free);
 		}
-		avc_cache.slots[i] = NULL;
 	}
-	avc_cache.lru_hint = 0;
 
-	spin_unlock_irqrestore(&avc_lock,flags);
+	rcu_read_unlock();
 
 	for (i = 0; i < AVC_NSTATS; i++)
 		avc_cache_stats[i] = 0;
 
@@ -895,12 +916,9 @@
 				goto out;
 		}
 	}
 
-	spin_lock_irqsave(&avc_lock,flags);
-	if (seqno > avc_cache.latest_notif)
-		avc_cache.latest_notif = seqno;
-	spin_unlock_irqrestore(&avc_lock,flags);
+	avc_latest_notif_update(seqno, 0);
 out:
 	return rc;
 }
 
@@ -949,9 +967,9 @@
  * @ssid: source security identifier
  * @tsid: target security identifier
  * @tclass: target security class
  * @requested: requested permissions, interpreted based on @tclass
- * @aeref:  AVC entry reference
+ * @aeref:  AVC entry reference(not in use)
  * @avd: access vector decisions
  *
  * Check the AVC to determine whether the @requested permissions are granted
  * for the SID pair (@ssid, @tsid), interpreting the permissions
@@ -968,72 +986,45 @@
 int avc_has_perm_noaudit(u32 ssid, u32 tsid,
                          u16 tclass, u32 requested,
                          struct avc_entry_ref *aeref, struct av_decision *avd)
 {
-	struct avc_entry *ae;
-	int rc = 0;
-	unsigned long flags;
+	struct avc_node *node;
 	struct avc_entry entry;
+	int rc = 0;
 	u32 denied;
-	struct avc_entry_ref ref;
 
-	if (!aeref) {
-		avc_entry_ref_init(&ref);
-		aeref = &ref;
-	}
-
-	spin_lock_irqsave(&avc_lock, flags);
+	rcu_read_lock();
 	avc_cache_stats_incr(AVC_ENTRY_LOOKUPS);
-	ae = aeref->ae;
-	if (ae) {
-		if (ae->ssid == ssid &&
-		    ae->tsid == tsid &&
-		    ae->tclass == tclass &&
-		    ((ae->avd.decided & requested) == requested)) {
-			avc_cache_stats_incr(AVC_ENTRY_HITS);
-			ae->used = 1;
-		} else {
-			avc_cache_stats_incr(AVC_ENTRY_DISCARDS);
-			ae = NULL;
-		}
-	}
 
-	if (!ae) {
-		avc_cache_stats_incr(AVC_ENTRY_MISSES);
-		rc = avc_lookup(ssid, tsid, tclass, requested, aeref);
-		if (rc) {
-			spin_unlock_irqrestore(&avc_lock,flags);
-			rc = security_compute_av(ssid,tsid,tclass,requested,&entry.avd);
-			if (rc)
-				goto out;
-			spin_lock_irqsave(&avc_lock, flags);
-			rc = avc_insert(ssid,tsid,tclass,&entry,aeref);
-			if (rc) {
-				spin_unlock_irqrestore(&avc_lock,flags);
-				goto out;
-			}
+	node = avc_lookup(ssid, tsid, tclass, requested);
+	if (!node) {
+		rcu_read_unlock();
+		rc = security_compute_av(ssid,tsid,tclass,requested,&entry.avd);
+		if (rc)
+			goto out;
+		rcu_read_lock();
+		node = avc_insert(ssid,tsid,tclass,&entry);
+		if (!node) {
+			rcu_read_unlock();
+			goto out;
 		}
-		ae = aeref->ae;
 	}
-
 	if (avd)
-		memcpy(avd, &ae->avd, sizeof(*avd));
+		memcpy(avd, &node->ae.avd, sizeof(*avd));
 
-	denied = requested & ~(ae->avd.allowed);
+	denied = requested & ~(node->ae.avd.allowed);
 
 	if (!requested || denied) {
 		if (selinux_enforcing) {
-			spin_unlock_irqrestore(&avc_lock,flags);
 			rc = -EACCES;
-			goto out;
 		} else {
-			ae->avd.allowed |= requested;
-			spin_unlock_irqrestore(&avc_lock,flags);
-			goto out;
+			if (node->ae.avd.allowed != (node->ae.avd.allowed|requested))
+				avc_update_node(AVC_CALLBACK_GRANT
+				                ,requested,ssid,tsid,tclass);
 		}
 	}
 
-	spin_unlock_irqrestore(&avc_lock,flags);
+	rcu_read_unlock();
 out:
 	return rc;
 }
 
@@ -1042,9 +1033,9 @@
  * @ssid: source security identifier
  * @tsid: target security identifier
  * @tclass: target security class
  * @requested: requested permissions, interpreted based on @tclass
- * @aeref:  AVC entry reference
+ * @aeref:  AVC entry reference(not in use)
  * @auditdata: auxiliary audit data
  *
  * Check the AVC to determine whether the @requested permissions are granted
  * for the SID pair (@ssid, @tsid), interpreting the permissions
@@ -1061,8 +1052,8 @@
 {
 	struct av_decision avd;
 	int rc;
 
-	rc = avc_has_perm_noaudit(ssid, tsid, tclass, requested, aeref, &avd);
+	rc = avc_has_perm_noaudit(ssid, tsid, tclass, requested, NULL, &avd);
 	avc_audit(ssid, tsid, tclass, requested, &avd, rc, auditdata);
 	return rc;
 }
diff -rNU4 linux-2.6.8.1/security/selinux/include/avc.h linux-2.6.8.1.rcu/security/selinux/include/avc.h
--- linux-2.6.8.1/security/selinux/include/avc.h	2004-08-14 19:54:51.000000000 +0900
+++ linux-2.6.8.1.rcu/security/selinux/include/avc.h	2004-08-20 20:40:50.000000000 +0900
@@ -118,13 +118,11 @@
  */
 
 void __init avc_init(void);
 
-int avc_lookup(u32 ssid, u32 tsid, u16 tclass,
-               u32 requested, struct avc_entry_ref *aeref);
+struct avc_node *avc_lookup(u32 ssid, u32 tsid, u16 tclass, u32 requested);
 
-int avc_insert(u32 ssid, u32 tsid, u16 tclass,
-               struct avc_entry *ae, struct avc_entry_ref *out_aeref);
+struct avc_node *avc_insert(u32 ssid, u32 tsid, u16 tclass, struct avc_entry *ae);
 
 void avc_audit(u32 ssid, u32 tsid,
                u16 tclass, u32 requested,
                struct av_decision *avd, int result, struct avc_audit_data *auditdata);

[-- Attachment #3: list_replace_rcu-2.6.8.1.patch --]
[-- Type: application/octet-stream, Size: 845 bytes --]

--- linux-2.6.8.1/include/linux/list.h	2004-08-14 19:55:33.000000000 +0900
+++ linux-2.6.8.1.rcu/include/linux/list.h	2004-08-20 18:04:10.000000000 +0900
@@ -194,8 +194,23 @@
 	__list_del(entry->prev, entry->next);
 	entry->prev = LIST_POISON2;
 }
 
+/*
+ * list_replace_rcu - replace old entry by new onw from list
+ * @old : the element to be replaced from the list.
+ * @new : the new element to insert to the list.
+ * 
+ * The old entry will be replaced to the new entry atomically.
+ */
+static inline void list_replace_rcu(struct list_head *old, struct list_head *new){
+	new->next = old->next;
+	new->prev = old->prev;
+	smp_wmb();
+	new->next->prev = new;
+	new->prev->next = new;
+}
+
 /**
  * list_del_init - deletes entry from list and reinitialize it.
  * @entry: the element to delete from the list.
  */

[-- Attachment #4: rcutest.c --]
[-- Type: application/octet-stream, Size: 1478 bytes --]

/* -----------------------------------------
usage:
  % gcc -O3 rcutest.c -o rcutest
  % N=`grep -c processor /proc/cpuinfo`
  % ./rcutest ${N}
  Copyright: KaiGai <kaigai@ak.jp.nec.com>
----------------------------------------- */

#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/wait.h>
#include <fcntl.h>
#include <unistd.h>
#define LOOP     (5000000)

void child_shmfile(){
  int i,fd;
  char buf[100],fname[100];
  
  sprintf(fname,"/dev/shm/file.%d",getpid());
  
  fd = open(fname,O_RDWR|O_CREAT|O_TRUNC);
  if( fd < 0 ){
    printf("NOTE: Process(%d) can not open %s",getpid(),fname);
    return;
  }
  
  for(i=0;i<LOOP;i++){
    write(fd,buf,1);
    lseek(fd,0,SEEK_SET);
  }
  
  unlink(fname);
}

int main(int argc, char *argv[]){
  int fd,i,j,cproc,nproc;
  int sec,usec;
  struct timeval stm,etm;
  
  if( argc!=2 )
    return( 1 );
  nproc = atoi(argv[1]);
  
  gettimeofday(&stm, NULL);
  for(i=0,cproc=0;i<nproc;i++){
    j = fork();
    if( j==0 ){
      child_shmfile();
      exit( 0 );
    }else if( j>0 ){
      cproc++;
    }
  }
  waitpid(-1,NULL,0);
  gettimeofday(&etm, NULL);

  sec = etm.tv_sec - stm.tv_sec;
  usec = etm.tv_usec - stm.tv_usec;
  if( usec<0 ){
    sec--;
    usec = 1000000 + usec;
  }
  printf("[%d Proc x %d Loop] Time: %.4f \n",cproc,LOOP,
	 (double)sec + ((double)usec)/10000000.0 );
}


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RCU issue with SELinux (Re: SELINUX performance issues)
  2004-08-20 13:36   ` Kaigai Kohei
@ 2004-08-20 14:53     ` James Morris
  2004-08-24  7:27       ` Kaigai Kohei
  2004-08-20 17:31     ` RCU issue with SELinux (Re: SELINUX performance issues) Luke Kenneth Casson Leighton
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 31+ messages in thread
From: James Morris @ 2004-08-20 14:53 UTC (permalink / raw)
  To: Kaigai Kohei; +Cc: Stephen Smalley, SELinux-ML(Eng), Linux Kernel ML(Eng)

On Fri, 20 Aug 2004, Kaigai Kohei wrote:

> The attached patches against to 2.6.8.1 kernel improve the performance
> and the scalability of SELinux by RCU-approach.


>                  --- 4CPU ---  --- 8CPU ---  --- 16CPU ---  --- 32CPU ---
> 2.6.8.1(disable)    8.0158[s]     8.0183[s]      8.0146[s]      8.0037[s]
>                  (2.08/ 6.07)   (1.86/6.31)   (0.78/ 7.33)    (2.03/5.94)
> 2.6.8.1(enable)    78.0957[s]   319.0451[s]   1322.0313[s]      too long 
>                  (2.47/76.48) (2.47/316.96)  (2.43/1319.8)     (---/---) 
> 2.6.8.1.rwlock     20.0100[s]    49.0390[s]     90.0200[s]    177.0533[s]
>                  (2.57/17.52)  (2.45/46.93)   (2.37/87.78)   (2.41/175.1)
> 2.6.8.1.rcu        11.0464[s]    11.0205[s]     11.0372[s]     11.0496[s]
>                   (2.64/8.82)   (2.21/8.98)   (2.67/ 8.68)    (2.51/8.95)
> -------------------------------------------------------------------------

Do you have figures for 1 and 2 CPU?

Also, can you run some more benchmarks, e.g. lmbench, unixbench, dbench 
etc?


- James
-- 
James Morris
<jmorris@redhat.com>



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RCU issue with SELinux (Re: SELINUX performance issues)
  2004-08-20 13:36   ` Kaigai Kohei
  2004-08-20 14:53     ` James Morris
@ 2004-08-20 17:31     ` Luke Kenneth Casson Leighton
  2004-08-20 18:15       ` James Morris
  2004-08-20 20:19     ` Paul E. McKenney
       [not found]     ` <1093014789.16585.186.camel@moss-spartans.epoch.ncsc.mil>
  3 siblings, 1 reply; 31+ messages in thread
From: Luke Kenneth Casson Leighton @ 2004-08-20 17:31 UTC (permalink / raw)
  To: Kaigai Kohei
  Cc: James Morris, Stephen Smalley, SELinux-ML(Eng), Linux Kernel ML(Eng)

On Fri, Aug 20, 2004 at 10:36:03PM +0900, Kaigai Kohei wrote:
> Hello, everyone.
> 
> Tuesday, August 17, 2004 12:19 AM
> James Morris wrote:
> > > Is removing direct reference to AVC-Entry approach acceptable?
> > > 
> > > I'll try to consider this issue further.
> > 
> > Sure, if you can make it work without problems.
> 
> The attached patches against to 2.6.8.1 kernel improve the performance
> and the scalability of SELinux by RCU-approach.
> 
> The evaluation results are as follows:
> <Environment>
> CPU: Itanium2(1GHz) x 4/8/16/32
> Memory: enough (no swap)
> OS: 2.6.8.1 (SELinux disabled by 'selinux=0' boot option)
>     2.6.8.1 (SELinux enabled)
>     2.6.8.1 + rwlock patch by KaiGai
>     2.6.8.1 + RCU patch by KaiGai
> 
> The test program iterates write() to files on tmpfs 500,000 times in parallel.
                                                ^^^^^
 i presume that that is without the xattr patch which adds extended
 attributes to tmpfs?

 (see http://hands.com/~lkcl/selinux)

 l.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RCU issue with SELinux (Re: SELINUX performance issues)
  2004-08-20 17:31     ` RCU issue with SELinux (Re: SELINUX performance issues) Luke Kenneth Casson Leighton
@ 2004-08-20 18:15       ` James Morris
  0 siblings, 0 replies; 31+ messages in thread
From: James Morris @ 2004-08-20 18:15 UTC (permalink / raw)
  To: Luke Kenneth Casson Leighton
  Cc: Kaigai Kohei, Stephen Smalley, SELinux-ML(Eng), Linux Kernel ML(Eng)

On Fri, 20 Aug 2004, Luke Kenneth Casson Leighton wrote:

>  i presume that that is without the xattr patch which adds extended
>  attributes to tmpfs?

Yes.

-- 
James Morris
<jmorris@redhat.com>



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RCU issue with SELinux (Re: SELINUX performance issues)
  2004-08-20 13:36   ` Kaigai Kohei
  2004-08-20 14:53     ` James Morris
  2004-08-20 17:31     ` RCU issue with SELinux (Re: SELINUX performance issues) Luke Kenneth Casson Leighton
@ 2004-08-20 20:19     ` Paul E. McKenney
  2004-08-20 20:35       ` James Morris
  2004-08-24  7:27       ` Kaigai Kohei
       [not found]     ` <1093014789.16585.186.camel@moss-spartans.epoch.ncsc.mil>
  3 siblings, 2 replies; 31+ messages in thread
From: Paul E. McKenney @ 2004-08-20 20:19 UTC (permalink / raw)
  To: Kaigai Kohei
  Cc: James Morris, Stephen Smalley, SELinux-ML(Eng), Linux Kernel ML(Eng)

On Fri, Aug 20, 2004 at 10:36:03PM +0900, Kaigai Kohei wrote:
> Hello, everyone.
> 
> Tuesday, August 17, 2004 12:19 AM
> James Morris wrote:
> > > Is removing direct reference to AVC-Entry approach acceptable?
> > > 
> > > I'll try to consider this issue further.
> > 
> > Sure, if you can make it work without problems.
> 
> The attached patches against to 2.6.8.1 kernel improve the performance
> and the scalability of SELinux by RCU-approach.
> 
> The evaluation results are as follows:
> <Environment>
> CPU: Itanium2(1GHz) x 4/8/16/32
> Memory: enough (no swap)
> OS: 2.6.8.1 (SELinux disabled by 'selinux=0' boot option)
>     2.6.8.1 (SELinux enabled)
>     2.6.8.1 + rwlock patch by KaiGai
>     2.6.8.1 + RCU patch by KaiGai
> 
> The test program iterates write() to files on tmpfs 500,000 times in parallel.
> It is attached as 'rcutest.c'.
> 
> We observed performance improvement and perfect scalability.
> 
> [rcutest Loop=500000 Parallel=<Num of CPU>]
>   - upper column is the RealTime [second].
>   - lefthand of lower column is the UserTime, righthand is the SysTime.
>                  --- 4CPU ---  --- 8CPU ---  --- 16CPU ---  --- 32CPU ---
> 2.6.8.1(disable)    8.0158[s]     8.0183[s]      8.0146[s]      8.0037[s]
>                  (2.08/ 6.07)   (1.86/6.31)   (0.78/ 7.33)    (2.03/5.94)
> 2.6.8.1(enable)    78.0957[s]   319.0451[s]   1322.0313[s]      too long 
>                  (2.47/76.48) (2.47/316.96)  (2.43/1319.8)     (---/---) 
> 2.6.8.1.rwlock     20.0100[s]    49.0390[s]     90.0200[s]    177.0533[s]
>                  (2.57/17.52)  (2.45/46.93)   (2.37/87.78)   (2.41/175.1)
> 2.6.8.1.rcu        11.0464[s]    11.0205[s]     11.0372[s]     11.0496[s]
>                   (2.64/8.82)   (2.21/8.98)   (2.67/ 8.68)    (2.51/8.95)
> -------------------------------------------------------------------------

Nice performance increase and great scalability!  I echo an earlier
email asking for 1-CPU and 2-CPU results.

> These patches achieve lock-free read access to AVC cache.
> 
> Significant changes are as follows:
> - Direct references by avc_entry_ref were removed for RCU.
> - Some structures(avc_entry/avc_node/avc_cache) were changed for RCU.
> - avc_node type objects are allocated dynamically, not statically.
>   avc_reclaim_node() reclaims some avc_node objects when the number of
>   objects exceeds AVC_CACHE_THRESHOLD.
> - For updating av_decision structure, I came up with a new RCU primitive,
>   list_replace_rcu() in include/linux/list.h .
>   It is needed to switch the linked pointer atomically, and replace
>   an old object with a new one. list_replace_rcu() is used in avc_update_node()
>   to update av_decision structure, and an original object is duplicated and updated.
>   We now pay more cost for updating av_decision structure as a compensation
>   for lock-free read access. But, in my opinion, updating av_decision is so rare.
> 
> NOTE: The fifth arguments of avc_has_perm() and avc_has_perm_noaudit()
>       are not necessary. But the prototype declarations and all the callers
>       remain, since I want to avoid messing up the patches.
> 
> Any comments?
> ----------
> KaiGai <kaigai@ak.jp.nec.com>

The list_replace_rcu() primitive looks good.  At last, a primitive that
reflects the reason for RCU's name!  ;-)

A few comments interspersed for the larger patch (look for empty lines).
I am missing why some of the list insertions and deletions are SMP safe.

							Thanx, Paul

> diff -rNU4 linux-2.6.8.1/security/selinux/avc.c linux-2.6.8.1.rcu/security/selinux/avc.c
> --- linux-2.6.8.1/security/selinux/avc.c	2004-08-14 19:55:48.000000000 +0900
> +++ linux-2.6.8.1.rcu/security/selinux/avc.c	2004-08-20 20:38:06.000000000 +0900
> @@ -35,28 +35,31 @@
>  #include "av_perm_to_string.h"
>  #include "objsec.h"
>  
>  #define AVC_CACHE_SLOTS		512
> -#define AVC_CACHE_MAXNODES	410
> +#define AVC_CACHE_THRESHOLD	410
> +#define AVC_CACHE_RECLAIM	16
>  
>  struct avc_entry {
>  	u32			ssid;
>  	u32			tsid;
>  	u16			tclass;
>  	struct av_decision	avd;
> -	int			used;	/* used recently */
> +	atomic_t		used;	/* used recently */
>  };
>  
>  struct avc_node {
> +	struct list_head	list;
> +	struct rcu_head		rhead;
>  	struct avc_entry	ae;
> -	struct avc_node		*next;
>  };
>  
>  struct avc_cache {
> -	struct avc_node	*slots[AVC_CACHE_SLOTS];
> -	u32		lru_hint;	/* LRU hint for reclaim scan */
> -	u32		active_nodes;
> -	u32		latest_notif;	/* latest revocation notification */
> +	struct list_head	slots[AVC_CACHE_SLOTS];
> +	spinlock_t		slots_lock[AVC_CACHE_SLOTS];	/* Lock for RCUupdate */
> +	atomic_t		lru_hint;	/* LRU hint for reclaim scan */
> +	atomic_t		active_nodes;
> +	u32			latest_notif;	/* latest revocation notification */
>  };
>  
>  struct avc_callback_node {
>  	int (*callback) (u32 event, u32 ssid, u32 tsid,
> @@ -69,10 +72,8 @@
>  	u32 perms;
>  	struct avc_callback_node *next;
>  };
>  
> -static spinlock_t avc_lock = SPIN_LOCK_UNLOCKED;
> -static struct avc_node *avc_node_freelist;
>  static struct avc_cache avc_cache;
>  static unsigned avc_cache_stats[AVC_NSTATS];
>  static struct avc_callback_node *avc_callbacks;
>  
> @@ -187,52 +188,44 @@
>   * Initialize the access vector cache.
>   */
>  void __init avc_init(void)
>  {
> -	struct avc_node	*new;
>  	int i;
>  
> -	for (i = 0; i < AVC_CACHE_MAXNODES; i++) {
> -		new = kmalloc(sizeof(*new), GFP_ATOMIC);
> -		if (!new) {
> -			printk(KERN_WARNING "avc:  only able to allocate "
> -			       "%d entries\n", i);
> -			break;
> -		}
> -		memset(new, 0, sizeof(*new));
> -		new->next = avc_node_freelist;
> -		avc_node_freelist = new;
> -	}
> -
> +	for (i =0; i < AVC_CACHE_SLOTS; i++) {
> +		INIT_LIST_HEAD(&avc_cache.slots[i]);
> +		avc_cache.slots_lock[i] = SPIN_LOCK_UNLOCKED;
> +	}
> +	atomic_set(&avc_cache.active_nodes, 0);
> +	atomic_set(&avc_cache.lru_hint, 0);
> +	
>  	audit_log(current->audit_context, "AVC INITIALIZED\n");
>  }
>  
>  #if 0
>  static void avc_hash_eval(char *tag)
>  {
>  	int i, chain_len, max_chain_len, slots_used;
>  	struct avc_node *node;
> +	struct list_head *pos;
>  	unsigned long flags;
>  
> -	spin_lock_irqsave(&avc_lock,flags);
> +	rcu_read_lock();
>  
>  	slots_used = 0;
>  	max_chain_len = 0;
>  	for (i = 0; i < AVC_CACHE_SLOTS; i++) {
> -		node = avc_cache.slots[i];
> -		if (node) {
> +		if (!list_empty(&avc_cache.slots[i])) {
>  			slots_used++;
>  			chain_len = 0;
> -			while (node) {
> +			list_for_each_rcu(pos, &avc_cache.slots[i])
>  				chain_len++;
> -				node = node->next;
> -			}
>  			if (chain_len > max_chain_len)
>  				max_chain_len = chain_len;
>  		}
>  	}
>  
> -	spin_unlock_irqrestore(&avc_lock,flags);
> +	rcu_read_unlock();
>  
>  	printk(KERN_INFO "\n");
>  	printk(KERN_INFO "%s avc:  %d entries and %d/%d buckets used, longest "
>  	       "chain length %d\n", tag, avc_cache.active_nodes, slots_used,
> @@ -242,188 +235,194 @@
>  static inline void avc_hash_eval(char *tag)
>  { }
>  #endif
>  
> -static inline struct avc_node *avc_reclaim_node(void)
> -{
> -	struct avc_node *prev, *cur;
> -	int hvalue, try;
> +static void avc_node_free(struct rcu_head *rhead) {
> +	struct avc_node *node;
> +	node = container_of(rhead, struct avc_node, rhead);
> +	kfree(node);
> +}
>  
> -	hvalue = avc_cache.lru_hint;
> -	for (try = 0; try < 2; try++) {
> -		do {
> -			prev = NULL;
> -			cur = avc_cache.slots[hvalue];
> -			while (cur) {
> -				if (!cur->ae.used)
> -					goto found;
> +static inline void avc_reclaim_node(void)
> +{
> +	struct avc_node *node;
> +	struct list_head *pos;
> +	int hvalue, try, ecx;
>  
> -				cur->ae.used = 0;
> +	for (try = 0, ecx = 0; try < AVC_CACHE_SLOTS; try++ ) {
> +		hvalue = atomic_inc_return(&avc_cache.lru_hint) % AVC_CACHE_SLOTS;
>  
> -				prev = cur;
> -				cur = cur->next;
> +		if (spin_trylock(&avc_cache.slots_lock[hvalue])) {
> +			list_for_each_rcu(pos, &avc_cache.slots[hvalue]) {

Since we are holding the lock, the list cannot change, and the _rcu()
is not needed.  Right?

> +				node = list_entry(pos, struct avc_node, list);

Why not just do:

		list_for_each_entry(pos, &avc_cache.slots[hvalue], list) {

rather than having the separate list_entry().

> +				if (!atomic_dec_and_test(&node->ae.used)) {
> +					/* Recently Unused */
> +					list_del_rcu(&node->list);
> +					call_rcu(&node->rhead, avc_node_free);
> +					atomic_dec(&avc_cache.active_nodes);
> +					ecx++;
> +					if (ecx >= AVC_CACHE_RECLAIM) {
> +						spin_unlock(&avc_cache.slots_lock[hvalue]);
> +						goto out;
> +					}
> +				}
>  			}
> -			hvalue = (hvalue + 1) & (AVC_CACHE_SLOTS - 1);
> -		} while (hvalue != avc_cache.lru_hint);
> +			spin_unlock(&avc_cache.slots_lock[hvalue]);
> +		}
>  	}
> -
> -	panic("avc_reclaim_node");
> -
> -found:
> -	avc_cache.lru_hint = hvalue;
> -
> -	if (prev == NULL)
> -		avc_cache.slots[hvalue] = cur->next;
> -	else
> -		prev->next = cur->next;
> -
> -	return cur;
> +out:
> +	return;
>  }
>  
> -static inline struct avc_node *avc_claim_node(u32 ssid,
> -                                              u32 tsid, u16 tclass)
> +static inline struct avc_node *avc_get_node(void)
>  {
>  	struct avc_node *new;
> -	int hvalue;
> +	int actives;
>  
> -	hvalue = avc_hash(ssid, tsid, tclass);
> -	if (avc_node_freelist) {
> -		new = avc_node_freelist;
> -		avc_node_freelist = avc_node_freelist->next;
> -		avc_cache.active_nodes++;
> -	} else {
> -		new = avc_reclaim_node();
> -		if (!new)
> -			goto out;
> +	new = kmalloc(sizeof(struct avc_node), GFP_ATOMIC);
> +	if (new) {
> +		actives = atomic_inc_return(&avc_cache.active_nodes);
> +		if (actives > AVC_CACHE_THRESHOLD)
> +			avc_reclaim_node();
>  	}
>  
> -	new->ae.used = 1;
> -	new->ae.ssid = ssid;
> -	new->ae.tsid = tsid;
> -	new->ae.tclass = tclass;
> -	new->next = avc_cache.slots[hvalue];
> -	avc_cache.slots[hvalue] = new;
> -
> -out:
>  	return new;
>  }
>  
>  static inline struct avc_node *avc_search_node(u32 ssid, u32 tsid,
>                                                 u16 tclass, int *probes)
>  {
> -	struct avc_node *cur;
> +	struct list_head *pos;
> +	struct avc_node *node, *ret = NULL;
>  	int hvalue;
>  	int tprobes = 1;
>  
>  	hvalue = avc_hash(ssid, tsid, tclass);
> -	cur = avc_cache.slots[hvalue];
> -	while (cur != NULL &&
> -	       (ssid != cur->ae.ssid ||
> -		tclass != cur->ae.tclass ||
> -		tsid != cur->ae.tsid)) {
> +	list_for_each_rcu(pos, &avc_cache.slots[hvalue]) {
> +		node = list_entry(pos, struct avc_node, list);

Why not just do:

		list_for_each_entry_rcu(pos, &avc_cache.slots[hvalue], list) {

rather than having the separate list_entry().

> +		if (ssid == node->ae.ssid &&
> +		    tclass == node->ae.tclass &&
> +		    tsid == node->ae.tsid) {
> +			ret = node;
> +			break;
> +		}
>  		tprobes++;
> -		cur = cur->next;
>  	}
>  
> -	if (cur == NULL) {
> +	if (ret == NULL) {
>  		/* cache miss */
>  		goto out;
>  	}
>  
>  	/* cache hit */
>  	if (probes)
>  		*probes = tprobes;
> -
> -	cur->ae.used = 1;
> -
> +	if (atomic_read(&ret->ae.used) != 1)
> +		atomic_set(&ret->ae.used, 1);
>  out:
> -	return cur;
> +	return ret;
>  }
>  
>  /**
>   * avc_lookup - Look up an AVC entry.
>   * @ssid: source security identifier
>   * @tsid: target security identifier
>   * @tclass: target security class
>   * @requested: requested permissions, interpreted based on @tclass
> - * @aeref:  AVC entry reference
>   *
>   * Look up an AVC entry that is valid for the
>   * @requested permissions between the SID pair
>   * (@ssid, @tsid), interpreting the permissions
>   * based on @tclass.  If a valid AVC entry exists,
> - * then this function updates @aeref to refer to the
> - * entry and returns %0. Otherwise, this function
> - * returns -%ENOENT.
> + * then this function return the avc_node.
> + * Otherwise, this function returns NULL.
>   */
> -int avc_lookup(u32 ssid, u32 tsid, u16 tclass,
> -               u32 requested, struct avc_entry_ref *aeref)
> +struct avc_node *avc_lookup(u32 ssid, u32 tsid, u16 tclass, u32 requested)
>  {
>  	struct avc_node *node;
> -	int probes, rc = 0;
> +	int probes;
>  
>  	avc_cache_stats_incr(AVC_CAV_LOOKUPS);
>  	node = avc_search_node(ssid, tsid, tclass,&probes);
>  
>  	if (node && ((node->ae.avd.decided & requested) == requested)) {
>  		avc_cache_stats_incr(AVC_CAV_HITS);
>  		avc_cache_stats_add(AVC_CAV_PROBES,probes);
> -		aeref->ae = &node->ae;
>  		goto out;
>  	}
>  
>  	avc_cache_stats_incr(AVC_CAV_MISSES);
> -	rc = -ENOENT;
>  out:
> -	return rc;
> +	return node;
> +}
> +
> +static int avc_latest_notif_update(int seqno, int is_insert)
> +{
> +	int ret = 0;
> +	static spinlock_t notif_lock = SPIN_LOCK_UNLOCKED;
> +
> +	spin_lock(&notif_lock);
> +	if (seqno < avc_cache.latest_notif) {
> +		if (is_insert) {
> +			printk(KERN_WARNING "avc:  seqno %d < latest_notif %d\n",
> +			       seqno, avc_cache.latest_notif);
> +			ret = -EAGAIN;
> +		} else {
> +			avc_cache.latest_notif = seqno;
> +		}
> +	}
> +	spin_unlock(&notif_lock);
> +	return ret;
>  }
>  
>  /**
>   * avc_insert - Insert an AVC entry.
>   * @ssid: source security identifier
>   * @tsid: target security identifier
>   * @tclass: target security class
>   * @ae: AVC entry
> - * @aeref:  AVC entry reference
>   *
>   * Insert an AVC entry for the SID pair
>   * (@ssid, @tsid) and class @tclass.
>   * The access vectors and the sequence number are
>   * normally provided by the security server in
>   * response to a security_compute_av() call.  If the
>   * sequence number @ae->avd.seqno is not less than the latest
>   * revocation notification, then the function copies
> - * the access vectors into a cache entry, updates
> - * @aeref to refer to the entry, and returns %0.
> - * Otherwise, this function returns -%EAGAIN.
> + * the access vectors into a cache entry, returns
> + * avc_node inserted. Otherwise, this function returns NULL.
>   */
> -int avc_insert(u32 ssid, u32 tsid, u16 tclass,
> -               struct avc_entry *ae, struct avc_entry_ref *aeref)
> +struct avc_node *avc_insert(u32 ssid, u32 tsid, u16 tclass, struct avc_entry *ae)
>  {
> -	struct avc_node *node;
> -	int rc = 0;
> +	struct avc_node *node = NULL;
> +	int hvalue;
>  
> -	if (ae->avd.seqno < avc_cache.latest_notif) {
> -		printk(KERN_WARNING "avc:  seqno %d < latest_notif %d\n",
> -		       ae->avd.seqno, avc_cache.latest_notif);
> -		rc = -EAGAIN;
> +	if (avc_latest_notif_update(ae->avd.seqno, 1))
>  		goto out;
> -	}
>  
> -	node = avc_claim_node(ssid, tsid, tclass);
> -	if (!node) {
> -		rc = -ENOMEM;
> -		goto out;
> -	}
> +	node = avc_get_node();
> +
> +	if (node) {
> +		hvalue = avc_hash(ssid, tsid, tclass);
> +
> +		INIT_LIST_HEAD(&node->list);
> +		INIT_RCU_HEAD(&node->rhead);
>  
> -	node->ae.avd.allowed = ae->avd.allowed;
> -	node->ae.avd.decided = ae->avd.decided;
> -	node->ae.avd.auditallow = ae->avd.auditallow;
> -	node->ae.avd.auditdeny = ae->avd.auditdeny;
> -	node->ae.avd.seqno = ae->avd.seqno;
> -	aeref->ae = &node->ae;
> +		node->ae.ssid = ssid;
> +		node->ae.tsid = tsid;
> +		node->ae.tclass = tclass;
> +		atomic_set(&node->ae.used, 1);
> +
> +		node->ae.avd.allowed = ae->avd.allowed;
> +		node->ae.avd.decided = ae->avd.decided;
> +		node->ae.avd.auditallow = ae->avd.auditallow;
> +		node->ae.avd.auditdeny = ae->avd.auditdeny;
> +		node->ae.avd.seqno = ae->avd.seqno;
> +
> +		list_add_rcu(&node->list, &avc_cache.slots[hvalue]);
> +	}
>  out:
> -	return rc;
> +	return node;
>  }
>  
>  static inline void avc_print_ipv6_addr(struct audit_buffer *ab,
>  				       struct in6_addr *addr, u16 port,
> @@ -685,63 +684,99 @@
>  {
>  	return (x == y || x == SECSID_WILD || y == SECSID_WILD);
>  }
>  
> -static inline void avc_update_node(u32 event, struct avc_node *node, u32 perms)
> +static void avc_update_node(u32 event, u32 perms, u32 ssid, u32 tsid, u16 tclass)
>  {
> -	switch (event) {
> -	case AVC_CALLBACK_GRANT:
> -		node->ae.avd.allowed |= perms;
> -		break;
> -	case AVC_CALLBACK_TRY_REVOKE:
> -	case AVC_CALLBACK_REVOKE:
> -		node->ae.avd.allowed &= ~perms;
> -		break;
> -	case AVC_CALLBACK_AUDITALLOW_ENABLE:
> -		node->ae.avd.auditallow |= perms;
> -		break;
> -	case AVC_CALLBACK_AUDITALLOW_DISABLE:
> -		node->ae.avd.auditallow &= ~perms;
> -		break;
> -	case AVC_CALLBACK_AUDITDENY_ENABLE:
> -		node->ae.avd.auditdeny |= perms;
> -		break;
> -	case AVC_CALLBACK_AUDITDENY_DISABLE:
> -		node->ae.avd.auditdeny &= ~perms;
> -		break;
> +	int hvalue;
> +	struct list_head *pos;
> +	struct avc_node *new, *node, *org = NULL;
> +
> +	new = kmalloc(sizeof(struct avc_node), GFP_ATOMIC);
> +	hvalue = avc_hash(ssid, tsid, tclass);
> +
> +	/* Lock the target slot */
> +	spin_lock(&avc_cache.slots_lock[hvalue]);
> +	list_for_each(pos, &avc_cache.slots[hvalue]) {
> +		node = list_entry(pos, struct avc_node, list);

Why not just do:

		list_for_each_entry_rcu(pos, &avc_cache.slots[hvalue], list) {

rather than having the separate list_entry().

> +		if (ssid==node->ae.ssid &&
> +		    tsid==node->ae.tsid &&
> +		    tclass==node->ae.tclass ) {
> +			org = node;
> +			break;
> +		}
> +	}
> +
> +	if (org) {
> +		if (!new) {

If we run out of memory, we silently delete the node that we were
wanting to update?  Is this really what you want?

OK, OK, given that the "C" in "AVC" stands for "cache", I guess that
deleting the element does indeed make sense...

> +			list_del_rcu(&org->list);
> +			call_rcu(&org->rhead, avc_node_free);
> +			goto out;
> +		}
> +
> +		/* Duplicate and Update */
> +		memcpy(new, org, sizeof(struct avc_node));
> +
> +		switch (event) {
> +		case AVC_CALLBACK_GRANT:
> +			new->ae.avd.allowed |= perms;

This is a 32-bit field, and is protected by a lock.  Therefore, only
one update should be happening at a time.  This field is 32-bit aligned
(right?), and so the write that does the update will appear atomic to
readers.  Only one of the the updates happens in a given call.

So it should be safe to make the updates in place, right?

Some possible reasons why it might -not- be safe to do this:

o	There is some bizarre CPU out there that does not do
	atomic 32-bit writes (but this would break all sorts of
	stuff).

o	Consecutive changes might give a slow reader the incorrect
	impression that permissions are wide open.  One way to take
	care of this would be to force a grace period between each
	change.  One way to do this would be to set a bit indicating
	an update, and have the call_rcu() clear it.  When getting
	ready to update, if the bit is still set, block until the
	bit is cleared.  This would guarantee that a given reader
	would see at most one update.

	But this would require considerable code restructuring, since
	this code is still under an rcu_read_lock().  But putting
	forward the thought anyway in case it is helpful.

	If updates are -really- rare, a simple way to make this happen
	would be to protect the updates with a sema_t, and just do
	an unconditional synchronize_kernel() just before releasing
	the sema_t.

	Since the current code permits readers to see changes to
	several different nodes, a single synchronize_kernel() should
	suffice.

> +			break;
> +		case AVC_CALLBACK_TRY_REVOKE:
> +		case AVC_CALLBACK_REVOKE:
> +			new->ae.avd.allowed &= ~perms;
> +			break;
> +		case AVC_CALLBACK_AUDITALLOW_ENABLE:
> +			new->ae.avd.auditallow |= perms;
> +			break;
> +		case AVC_CALLBACK_AUDITALLOW_DISABLE:
> +			new->ae.avd.auditallow &= ~perms;
> +			break;
> +		case AVC_CALLBACK_AUDITDENY_ENABLE:
> +			new->ae.avd.auditdeny |= perms;
> +			break;
> +		case AVC_CALLBACK_AUDITDENY_DISABLE:
> +			new->ae.avd.auditdeny &= ~perms;
> +			break;
> +		}
> +
> +		list_replace_rcu(&org->list,&new->list);
> +		call_rcu(&org->rhead, avc_node_free);
>  	}
> +out:
> +	spin_unlock(&avc_cache.slots_lock[hvalue]);
>  }
>  
>  static int avc_update_cache(u32 event, u32 ssid, u32 tsid,
>                              u16 tclass, u32 perms)
>  {
> +	struct list_head *pos;
>  	struct avc_node *node;
>  	int i;
> -	unsigned long flags;
>  
> -	spin_lock_irqsave(&avc_lock,flags);
> +	rcu_read_lock();
>  
>  	if (ssid == SECSID_WILD || tsid == SECSID_WILD) {
>  		/* apply to all matching nodes */
>  		for (i = 0; i < AVC_CACHE_SLOTS; i++) {
> -			for (node = avc_cache.slots[i]; node;
> -			     node = node->next) {
> +			list_for_each_rcu(pos, &avc_cache.slots[i] ) {
> +				node = list_entry(pos, struct avc_node, list);

Why not just do:

		list_for_each_entry_rcu(pos, &avc_cache.slots[i], list) {

rather than having the separate list_entry().

>  				if (avc_sidcmp(ssid, node->ae.ssid) &&
>  				    avc_sidcmp(tsid, node->ae.tsid) &&
> -				    tclass == node->ae.tclass) {
> -					avc_update_node(event,node,perms);
> +				    tclass == node->ae.tclass ) {
> +					avc_update_node(event, perms, node->ae.ssid
> +					                ,node->ae.tsid, node->ae.tclass);
>  				}
>  			}
>  		}
>  	} else {
>  		/* apply to one node */
>  		node = avc_search_node(ssid, tsid, tclass, NULL);
>  		if (node) {
> -			avc_update_node(event,node,perms);
> +			avc_update_node(event, perms, ssid, tsid, tclass);
>  		}
>  	}
>  
> -	spin_unlock_irqrestore(&avc_lock,flags);
> +	rcu_read_unlock();
>  
>  	return 0;
>  }
>  
> @@ -751,9 +786,8 @@
>  {
>  	struct avc_callback_node *c;
>  	u32 tretained = 0, cretained = 0;
>  	int rc = 0;
> -	unsigned long flags;
>  
>  	/*
>  	 * try_revoke only removes permissions from the cache
>  	 * state if they are not retained by the object manager.
> @@ -786,12 +820,9 @@
>  		avc_update_cache(event,ssid,tsid,tclass,perms);
>  		*out_retained = tretained;
>  	}
>  
> -	spin_lock_irqsave(&avc_lock,flags);
> -	if (seqno > avc_cache.latest_notif)
> -		avc_cache.latest_notif = seqno;
> -	spin_unlock_irqrestore(&avc_lock,flags);
> +	avc_latest_notif_update(seqno, 0);
>  
>  out:
>  	return rc;
>  }
> @@ -856,34 +887,24 @@
>  int avc_ss_reset(u32 seqno)
>  {
>  	struct avc_callback_node *c;
>  	int i, rc = 0;
> -	struct avc_node *node, *tmp;
> -	unsigned long flags;
> +	struct list_head *pos;
> +	struct avc_node *node;
>  
>  	avc_hash_eval("reset");
>  
> -	spin_lock_irqsave(&avc_lock,flags);
> +	rcu_read_lock();
>  
>  	for (i = 0; i < AVC_CACHE_SLOTS; i++) {
> -		node = avc_cache.slots[i];
> -		while (node) {
> -			tmp = node;
> -			node = node->next;
> -			tmp->ae.ssid = tmp->ae.tsid = SECSID_NULL;
> -			tmp->ae.tclass = SECCLASS_NULL;
> -			tmp->ae.avd.allowed = tmp->ae.avd.decided = 0;
> -			tmp->ae.avd.auditallow = tmp->ae.avd.auditdeny = 0;
> -			tmp->ae.used = 0;
> -			tmp->next = avc_node_freelist;
> -			avc_node_freelist = tmp;
> -			avc_cache.active_nodes--;
> +		list_for_each_rcu(pos, &avc_cache.slots[i]) {
> +			node = list_entry(pos, struct avc_node, list);

Why not just do:

		list_for_each_entry_rcu(pos, &avc_cache.slots[i], list) {

rather than having the separate list_entry().

> +			list_del_rcu(&node->list);

I don't see what prevents multiple list_del_rcu()s from executing in
parallel, mangling the list.  Is there some higher-level lock?
If so, why do we need RCU protecting the list?

> +			call_rcu(&node->rhead, avc_node_free);
>  		}
> -		avc_cache.slots[i] = NULL;
>  	}
> -	avc_cache.lru_hint = 0;
>  
> -	spin_unlock_irqrestore(&avc_lock,flags);
> +	rcu_read_unlock();
>  
>  	for (i = 0; i < AVC_NSTATS; i++)
>  		avc_cache_stats[i] = 0;
>  
> @@ -895,12 +916,9 @@
>  				goto out;
>  		}
>  	}
>  
> -	spin_lock_irqsave(&avc_lock,flags);
> -	if (seqno > avc_cache.latest_notif)
> -		avc_cache.latest_notif = seqno;
> -	spin_unlock_irqrestore(&avc_lock,flags);
> +	avc_latest_notif_update(seqno, 0);
>  out:
>  	return rc;
>  }
>  
> @@ -949,9 +967,9 @@
>   * @ssid: source security identifier
>   * @tsid: target security identifier
>   * @tclass: target security class
>   * @requested: requested permissions, interpreted based on @tclass
> - * @aeref:  AVC entry reference
> + * @aeref:  AVC entry reference(not in use)
>   * @avd: access vector decisions
>   *
>   * Check the AVC to determine whether the @requested permissions are granted
>   * for the SID pair (@ssid, @tsid), interpreting the permissions
> @@ -968,72 +986,45 @@
>  int avc_has_perm_noaudit(u32 ssid, u32 tsid,
>                           u16 tclass, u32 requested,
>                           struct avc_entry_ref *aeref, struct av_decision *avd)
>  {
> -	struct avc_entry *ae;
> -	int rc = 0;
> -	unsigned long flags;
> +	struct avc_node *node;
>  	struct avc_entry entry;
> +	int rc = 0;
>  	u32 denied;
> -	struct avc_entry_ref ref;
>  
> -	if (!aeref) {
> -		avc_entry_ref_init(&ref);
> -		aeref = &ref;
> -	}
> -
> -	spin_lock_irqsave(&avc_lock, flags);
> +	rcu_read_lock();
>  	avc_cache_stats_incr(AVC_ENTRY_LOOKUPS);
> -	ae = aeref->ae;
> -	if (ae) {
> -		if (ae->ssid == ssid &&
> -		    ae->tsid == tsid &&
> -		    ae->tclass == tclass &&
> -		    ((ae->avd.decided & requested) == requested)) {
> -			avc_cache_stats_incr(AVC_ENTRY_HITS);
> -			ae->used = 1;
> -		} else {
> -			avc_cache_stats_incr(AVC_ENTRY_DISCARDS);
> -			ae = NULL;
> -		}
> -	}
>  
> -	if (!ae) {
> -		avc_cache_stats_incr(AVC_ENTRY_MISSES);
> -		rc = avc_lookup(ssid, tsid, tclass, requested, aeref);
> -		if (rc) {
> -			spin_unlock_irqrestore(&avc_lock,flags);
> -			rc = security_compute_av(ssid,tsid,tclass,requested,&entry.avd);
> -			if (rc)
> -				goto out;
> -			spin_lock_irqsave(&avc_lock, flags);
> -			rc = avc_insert(ssid,tsid,tclass,&entry,aeref);
> -			if (rc) {
> -				spin_unlock_irqrestore(&avc_lock,flags);
> -				goto out;
> -			}
> +	node = avc_lookup(ssid, tsid, tclass, requested);
> +	if (!node) {
> +		rcu_read_unlock();
> +		rc = security_compute_av(ssid,tsid,tclass,requested,&entry.avd);
> +		if (rc)
> +			goto out;
> +		rcu_read_lock();
> +		node = avc_insert(ssid,tsid,tclass,&entry);

I don't see what prevents two copies of avc_insert from running in parallel.
Seems like this would trash the lists.  Or, if there is a higher-level lock
that I am missing, why do we need RCU protecting the lists?

> +		if (!node) {
> +			rcu_read_unlock();
> +			goto out;
>  		}
> -		ae = aeref->ae;
>  	}
> -
>  	if (avd)
> -		memcpy(avd, &ae->avd, sizeof(*avd));
> +		memcpy(avd, &node->ae.avd, sizeof(*avd));
>  
> -	denied = requested & ~(ae->avd.allowed);
> +	denied = requested & ~(node->ae.avd.allowed);
>  
>  	if (!requested || denied) {
>  		if (selinux_enforcing) {
> -			spin_unlock_irqrestore(&avc_lock,flags);
>  			rc = -EACCES;
> -			goto out;
>  		} else {
> -			ae->avd.allowed |= requested;
> -			spin_unlock_irqrestore(&avc_lock,flags);
> -			goto out;
> +			if (node->ae.avd.allowed != (node->ae.avd.allowed|requested))
> +				avc_update_node(AVC_CALLBACK_GRANT
> +				                ,requested,ssid,tsid,tclass);
>  		}
>  	}
>  
> -	spin_unlock_irqrestore(&avc_lock,flags);
> +	rcu_read_unlock();
>  out:
>  	return rc;
>  }
>  
> @@ -1042,9 +1033,9 @@
>   * @ssid: source security identifier
>   * @tsid: target security identifier
>   * @tclass: target security class
>   * @requested: requested permissions, interpreted based on @tclass
> - * @aeref:  AVC entry reference
> + * @aeref:  AVC entry reference(not in use)
>   * @auditdata: auxiliary audit data
>   *
>   * Check the AVC to determine whether the @requested permissions are granted
>   * for the SID pair (@ssid, @tsid), interpreting the permissions
> @@ -1061,8 +1052,8 @@
>  {
>  	struct av_decision avd;
>  	int rc;
>  
> -	rc = avc_has_perm_noaudit(ssid, tsid, tclass, requested, aeref, &avd);
> +	rc = avc_has_perm_noaudit(ssid, tsid, tclass, requested, NULL, &avd);
>  	avc_audit(ssid, tsid, tclass, requested, &avd, rc, auditdata);
>  	return rc;
>  }
> diff -rNU4 linux-2.6.8.1/security/selinux/include/avc.h linux-2.6.8.1.rcu/security/selinux/include/avc.h
> --- linux-2.6.8.1/security/selinux/include/avc.h	2004-08-14 19:54:51.000000000 +0900
> +++ linux-2.6.8.1.rcu/security/selinux/include/avc.h	2004-08-20 20:40:50.000000000 +0900
> @@ -118,13 +118,11 @@
>   */
>  
>  void __init avc_init(void);
>  
> -int avc_lookup(u32 ssid, u32 tsid, u16 tclass,
> -               u32 requested, struct avc_entry_ref *aeref);
> +struct avc_node *avc_lookup(u32 ssid, u32 tsid, u16 tclass, u32 requested);
>  
> -int avc_insert(u32 ssid, u32 tsid, u16 tclass,
> -               struct avc_entry *ae, struct avc_entry_ref *out_aeref);
> +struct avc_node *avc_insert(u32 ssid, u32 tsid, u16 tclass, struct avc_entry *ae);
>  
>  void avc_audit(u32 ssid, u32 tsid,
>                 u16 tclass, u32 requested,
>                 struct av_decision *avd, int result, struct avc_audit_data *auditdata);

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RCU issue with SELinux (Re: SELINUX performance issues)
  2004-08-20 20:19     ` Paul E. McKenney
@ 2004-08-20 20:35       ` James Morris
  2004-08-24  7:27       ` Kaigai Kohei
  1 sibling, 0 replies; 31+ messages in thread
From: James Morris @ 2004-08-20 20:35 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Kaigai Kohei, Stephen Smalley, SELinux-ML(Eng), Linux Kernel ML(Eng)


> > +		hvalue = atomic_inc_return(&avc_cache.lru_hint) % AVC_CACHE_SLOTS;

atomic_inc_return() is not implemented on ia32 or x86-64.  Is there a 
workaround, or do we need to implement it?  (Andi Kleen suggested using 
the xadd instruction and altinstructions for i386).


- James
-- 
James Morris
<jmorris@redhat.com>



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RCU issue with SELinux (Re: SELINUX performance issues)
       [not found]     ` <1093014789.16585.186.camel@moss-spartans.epoch.ncsc.mil>
@ 2004-08-24  7:25       ` Kaigai Kohei
  2004-08-24 15:37         ` Stephen Smalley
  2004-08-24 23:02         ` RCU issue with SELinux (Re: SELINUX performance issues) Paul E. McKenney
  0 siblings, 2 replies; 31+ messages in thread
From: Kaigai Kohei @ 2004-08-24  7:25 UTC (permalink / raw)
  To: Stephen Smalley; +Cc: SELinux-ML(Eng), Linux Kernel ML(Eng), James Morris

[-- Attachment #1: Type: text/plain, Size: 2027 bytes --]

Hi Stephen, Thanks for your comments.

> I'm not overly familiar with RCU myself, but the comments in list.h for
> list_add_rcu suggest that you still need to hold a lock to avoid racing
> with another list_add_rcu or list_del_rcu call on the same list.  But
> avc_insert is calling list_add_rcu without holding any lock; can't it
> race with another avc_insert on the same hash bucket?  Do I just
> misunderstand, or is this unsafe?  Thanks for clarifying.

You are right. Indeed, the lock for hash bucket is also necessary
when avc_insert() is called. I fixed them.

> I think we can likely eliminate the mutation of the node in the
> !selinux_enforcing case in avc_has_perm_noaudit, i.e. eliminate the
> entire else clause and just fall through with rc still 0.  Adding the
> requested permissions to the node was simply to avoid flooding denials
> in permissive mode on the same permission check, but this can be
> addressed separately using the audit ratelimit mechanism.

I have another opinion.
This simple mechanism against the flood of audit log is necessary,
because it prevents the depletion of the system log buffer and denied log
all over the console when we are debugging the security policy in permissive mode.
So, I improved the avc_update_node() function and avc_node data structure.
It does not need kmalloc() when avc_update_node(). 

This approach is good for durability and compatibility of original implementation,
but double area of avc_nodes is needed for updating without kmalloc().
This approach can apply to any kinds of updating of avc_entry.
This idea is pretty complexer, though.

I modified the following points:
- We hold the lock for hash backet when avc_insert() and avc_ss_reset() are
  called for safety.
- list_for_each_rcu() and list_entry() are replaced by list_for_entry().
- avc_node_dual structure which contains two avc_node objects is defined. 
  It allows to do avc_update_node() without kmalloc() or any locks.

Any comments please. Thanks.
--------
Kai Gai <kaigai@ak.jp.nec.com>

[-- Attachment #2: list_replace_rcu-2.6.8.1.patch --]
[-- Type: application/octet-stream, Size: 845 bytes --]

--- linux-2.6.8.1/include/linux/list.h	2004-08-14 19:55:33.000000000 +0900
+++ linux-2.6.8.1.rcu/include/linux/list.h	2004-08-20 18:04:10.000000000 +0900
@@ -194,8 +194,23 @@
 	__list_del(entry->prev, entry->next);
 	entry->prev = LIST_POISON2;
 }
 
+/*
+ * list_replace_rcu - replace old entry by new onw from list
+ * @old : the element to be replaced from the list.
+ * @new : the new element to insert to the list.
+ * 
+ * The old entry will be replaced to the new entry atomically.
+ */
+static inline void list_replace_rcu(struct list_head *old, struct list_head *new){
+	new->next = old->next;
+	new->prev = old->prev;
+	smp_wmb();
+	new->next->prev = new;
+	new->prev->next = new;
+}
+
 /**
  * list_del_init - deletes entry from list and reinitialize it.
  * @entry: the element to delete from the list.
  */

[-- Attachment #3: selinux.rcu-2.6.8.1-take2.patch --]
[-- Type: application/octet-stream, Size: 20952 bytes --]

diff -rNU4 linux-2.6.8.1/security/selinux/avc.c linux-2.6.8.1.rcu/security/selinux/avc.c
--- linux-2.6.8.1/security/selinux/avc.c	2004-08-14 19:55:48.000000000 +0900
+++ linux-2.6.8.1.rcu/security/selinux/avc.c	2004-08-24 13:30:56.000000000 +0900
@@ -35,28 +35,40 @@
 #include "av_perm_to_string.h"
 #include "objsec.h"
 
 #define AVC_CACHE_SLOTS		512
-#define AVC_CACHE_MAXNODES	410
+#define AVC_CACHE_THRESHOLD	410
+#define AVC_CACHE_RECLAIM	16
 
 struct avc_entry {
 	u32			ssid;
 	u32			tsid;
 	u16			tclass;
 	struct av_decision	avd;
-	int			used;	/* used recently */
+	atomic_t		used;	/* used recently */
 };
 
+struct avc_node_dual;
 struct avc_node {
+	struct list_head	list;
 	struct avc_entry	ae;
-	struct avc_node		*next;
+	struct avc_node		*reserved;
+	struct avc_node_dual	*container;
+};
+
+/* dual avc_node is necessary for update */
+struct avc_node_dual {
+	struct rcu_head         rhead;
+	struct avc_node         node[2];
 };
 
 struct avc_cache {
-	struct avc_node	*slots[AVC_CACHE_SLOTS];
-	u32		lru_hint;	/* LRU hint for reclaim scan */
-	u32		active_nodes;
-	u32		latest_notif;	/* latest revocation notification */
+	struct list_head	slots[AVC_CACHE_SLOTS];
+	spinlock_t		slots_lock[AVC_CACHE_SLOTS];
+		/* lock for insert/update/delete and reset */
+	atomic_t		lru_hint;	/* LRU hint for reclaim scan */
+	atomic_t		active_nodes;
+	u32			latest_notif;	/* latest revocation notification */
 };
 
 struct avc_callback_node {
 	int (*callback) (u32 event, u32 ssid, u32 tsid,
@@ -69,10 +81,8 @@
 	u32 perms;
 	struct avc_callback_node *next;
 };
 
-static spinlock_t avc_lock = SPIN_LOCK_UNLOCKED;
-static struct avc_node *avc_node_freelist;
 static struct avc_cache avc_cache;
 static unsigned avc_cache_stats[AVC_NSTATS];
 static struct avc_callback_node *avc_callbacks;
 
@@ -187,52 +197,44 @@
  * Initialize the access vector cache.
  */
 void __init avc_init(void)
 {
-	struct avc_node	*new;
 	int i;
 
-	for (i = 0; i < AVC_CACHE_MAXNODES; i++) {
-		new = kmalloc(sizeof(*new), GFP_ATOMIC);
-		if (!new) {
-			printk(KERN_WARNING "avc:  only able to allocate "
-			       "%d entries\n", i);
-			break;
-		}
-		memset(new, 0, sizeof(*new));
-		new->next = avc_node_freelist;
-		avc_node_freelist = new;
-	}
-
+	for (i =0; i < AVC_CACHE_SLOTS; i++) {
+		INIT_LIST_HEAD(&avc_cache.slots[i]);
+		avc_cache.slots_lock[i] = SPIN_LOCK_UNLOCKED;
+	}
+	atomic_set(&avc_cache.active_nodes, 0);
+	atomic_set(&avc_cache.lru_hint, 0);
+	
 	audit_log(current->audit_context, "AVC INITIALIZED\n");
 }
 
 #if 0
 static void avc_hash_eval(char *tag)
 {
 	int i, chain_len, max_chain_len, slots_used;
 	struct avc_node *node;
+	struct list_head *pos;
 	unsigned long flags;
 
-	spin_lock_irqsave(&avc_lock,flags);
+	rcu_read_lock();
 
 	slots_used = 0;
 	max_chain_len = 0;
 	for (i = 0; i < AVC_CACHE_SLOTS; i++) {
-		node = avc_cache.slots[i];
-		if (node) {
+		if (!list_empty(&avc_cache.slots[i])) {
 			slots_used++;
 			chain_len = 0;
-			while (node) {
+			list_for_each_rcu(pos, &avc_cache.slots[i])
 				chain_len++;
-				node = node->next;
-			}
 			if (chain_len > max_chain_len)
 				max_chain_len = chain_len;
 		}
 	}
 
-	spin_unlock_irqrestore(&avc_lock,flags);
+	rcu_read_unlock();
 
 	printk(KERN_INFO "\n");
 	printk(KERN_INFO "%s avc:  %d entries and %d/%d buckets used, longest "
 	       "chain length %d\n", tag, avc_cache.active_nodes, slots_used,
@@ -242,188 +244,208 @@
 static inline void avc_hash_eval(char *tag)
 { }
 #endif
 
-static inline struct avc_node *avc_reclaim_node(void)
-{
-	struct avc_node *prev, *cur;
-	int hvalue, try;
+static void avc_node_free(struct rcu_head *rhead) {
+	struct avc_node_dual *dual_node;
+	dual_node = container_of(rhead, struct avc_node_dual, rhead);
+	kfree(dual_node);
+}
 
-	hvalue = avc_cache.lru_hint;
-	for (try = 0; try < 2; try++) {
-		do {
-			prev = NULL;
-			cur = avc_cache.slots[hvalue];
-			while (cur) {
-				if (!cur->ae.used)
-					goto found;
+static inline void avc_reclaim_node(void)
+{
+	struct avc_node *node;
+	int hvalue, try, ecx;
 
-				cur->ae.used = 0;
+	for (try = 0, ecx = 0; try < AVC_CACHE_SLOTS; try++ ) {
+		hvalue = atomic_inc_return(&avc_cache.lru_hint) % AVC_CACHE_SLOTS;
 
-				prev = cur;
-				cur = cur->next;
+		if (spin_trylock(&avc_cache.slots_lock[hvalue])) {
+			list_for_each_entry(node, &avc_cache.slots[hvalue], list) {
+				if (!atomic_dec_and_test(&node->ae.used)) {
+					/* Recently Unused */
+					list_del_rcu(&node->list);
+					call_rcu(&node->container->rhead, avc_node_free);
+					atomic_dec(&avc_cache.active_nodes);
+					ecx++;
+					if (ecx >= AVC_CACHE_RECLAIM) {
+						spin_unlock(&avc_cache.slots_lock[hvalue]);
+						goto out;
+					}
+				}
 			}
-			hvalue = (hvalue + 1) & (AVC_CACHE_SLOTS - 1);
-		} while (hvalue != avc_cache.lru_hint);
+			spin_unlock(&avc_cache.slots_lock[hvalue]);
+		}
 	}
-
-	panic("avc_reclaim_node");
-
-found:
-	avc_cache.lru_hint = hvalue;
-
-	if (prev == NULL)
-		avc_cache.slots[hvalue] = cur->next;
-	else
-		prev->next = cur->next;
-
-	return cur;
+out:
+	return;
 }
 
-static inline struct avc_node *avc_claim_node(u32 ssid,
-                                              u32 tsid, u16 tclass)
+static inline struct avc_node *avc_get_node(void)
 {
-	struct avc_node *new;
-	int hvalue;
+	struct avc_node_dual *new;
+	int actives;
 
-	hvalue = avc_hash(ssid, tsid, tclass);
-	if (avc_node_freelist) {
-		new = avc_node_freelist;
-		avc_node_freelist = avc_node_freelist->next;
-		avc_cache.active_nodes++;
-	} else {
-		new = avc_reclaim_node();
-		if (!new)
-			goto out;
-	}
+	new = kmalloc(sizeof(struct avc_node_dual), GFP_ATOMIC);
+	if (!new)
+		return NULL;
 
-	new->ae.used = 1;
-	new->ae.ssid = ssid;
-	new->ae.tsid = tsid;
-	new->ae.tclass = tclass;
-	new->next = avc_cache.slots[hvalue];
-	avc_cache.slots[hvalue] = new;
+	INIT_RCU_HEAD(&new->rhead);
+	INIT_LIST_HEAD(&new->node[0].list);
+	INIT_LIST_HEAD(&new->node[1].list);
+	new->node[0].container = new;
+	new->node[1].container = new;
+	new->node[0].reserved = &new->node[1];
+	new->node[1].reserved = &new->node[0];
 
-out:
-	return new;
+	actives = atomic_inc_return(&avc_cache.active_nodes);
+	if (actives > AVC_CACHE_THRESHOLD)
+		avc_reclaim_node();
+
+	return &new->node[0];
 }
 
 static inline struct avc_node *avc_search_node(u32 ssid, u32 tsid,
                                                u16 tclass, int *probes)
 {
-	struct avc_node *cur;
+	struct avc_node *node, *ret = NULL;
 	int hvalue;
 	int tprobes = 1;
 
 	hvalue = avc_hash(ssid, tsid, tclass);
-	cur = avc_cache.slots[hvalue];
-	while (cur != NULL &&
-	       (ssid != cur->ae.ssid ||
-		tclass != cur->ae.tclass ||
-		tsid != cur->ae.tsid)) {
+	list_for_each_entry(node, &avc_cache.slots[hvalue], list) {
+		if (ssid == node->ae.ssid &&
+		    tclass == node->ae.tclass &&
+		    tsid == node->ae.tsid) {
+			ret = node;
+			break;
+		}
 		tprobes++;
-		cur = cur->next;
 	}
 
-	if (cur == NULL) {
+	if (ret == NULL) {
 		/* cache miss */
 		goto out;
 	}
 
 	/* cache hit */
 	if (probes)
 		*probes = tprobes;
-
-	cur->ae.used = 1;
-
+	if (atomic_read(&ret->ae.used) != 1)
+		atomic_set(&ret->ae.used, 1);
 out:
-	return cur;
+	return ret;
 }
 
 /**
  * avc_lookup - Look up an AVC entry.
  * @ssid: source security identifier
  * @tsid: target security identifier
  * @tclass: target security class
  * @requested: requested permissions, interpreted based on @tclass
- * @aeref:  AVC entry reference
  *
  * Look up an AVC entry that is valid for the
  * @requested permissions between the SID pair
  * (@ssid, @tsid), interpreting the permissions
  * based on @tclass.  If a valid AVC entry exists,
- * then this function updates @aeref to refer to the
- * entry and returns %0. Otherwise, this function
- * returns -%ENOENT.
+ * then this function return the avc_node.
+ * Otherwise, this function returns NULL.
  */
-int avc_lookup(u32 ssid, u32 tsid, u16 tclass,
-               u32 requested, struct avc_entry_ref *aeref)
+struct avc_node *avc_lookup(u32 ssid, u32 tsid, u16 tclass, u32 requested)
 {
 	struct avc_node *node;
-	int probes, rc = 0;
+	int probes;
 
 	avc_cache_stats_incr(AVC_CAV_LOOKUPS);
 	node = avc_search_node(ssid, tsid, tclass,&probes);
 
 	if (node && ((node->ae.avd.decided & requested) == requested)) {
 		avc_cache_stats_incr(AVC_CAV_HITS);
 		avc_cache_stats_add(AVC_CAV_PROBES,probes);
-		aeref->ae = &node->ae;
 		goto out;
 	}
 
 	avc_cache_stats_incr(AVC_CAV_MISSES);
-	rc = -ENOENT;
 out:
-	return rc;
+	return node;
+}
+
+static int avc_latest_notif_update(int seqno, int is_insert)
+{
+	int ret = 0;
+	static spinlock_t notif_lock = SPIN_LOCK_UNLOCKED;
+
+	spin_lock(&notif_lock);
+	if (seqno < avc_cache.latest_notif) {
+		if (is_insert) {
+			printk(KERN_WARNING "avc:  seqno %d < latest_notif %d\n",
+			       seqno, avc_cache.latest_notif);
+			ret = -EAGAIN;
+		} else {
+			avc_cache.latest_notif = seqno;
+		}
+	}
+	spin_unlock(&notif_lock);
+	return ret;
 }
 
 /**
  * avc_insert - Insert an AVC entry.
  * @ssid: source security identifier
  * @tsid: target security identifier
  * @tclass: target security class
  * @ae: AVC entry
- * @aeref:  AVC entry reference
  *
  * Insert an AVC entry for the SID pair
  * (@ssid, @tsid) and class @tclass.
  * The access vectors and the sequence number are
  * normally provided by the security server in
  * response to a security_compute_av() call.  If the
  * sequence number @ae->avd.seqno is not less than the latest
  * revocation notification, then the function copies
- * the access vectors into a cache entry, updates
- * @aeref to refer to the entry, and returns %0.
- * Otherwise, this function returns -%EAGAIN.
+ * the access vectors into a cache entry, returns
+ * avc_node inserted. Otherwise, this function returns NULL.
  */
-int avc_insert(u32 ssid, u32 tsid, u16 tclass,
-               struct avc_entry *ae, struct avc_entry_ref *aeref)
+struct avc_node *avc_insert(u32 ssid, u32 tsid, u16 tclass, struct avc_entry *ae)
 {
-	struct avc_node *node;
-	int rc = 0;
+	struct avc_node *pos, *node = NULL;
+	int hvalue;
 
-	if (ae->avd.seqno < avc_cache.latest_notif) {
-		printk(KERN_WARNING "avc:  seqno %d < latest_notif %d\n",
-		       ae->avd.seqno, avc_cache.latest_notif);
-		rc = -EAGAIN;
+	if (avc_latest_notif_update(ae->avd.seqno, 1))
 		goto out;
-	}
 
-	node = avc_claim_node(ssid, tsid, tclass);
-	if (!node) {
-		rc = -ENOMEM;
-		goto out;
-	}
+	node = avc_get_node();
 
-	node->ae.avd.allowed = ae->avd.allowed;
-	node->ae.avd.decided = ae->avd.decided;
-	node->ae.avd.auditallow = ae->avd.auditallow;
-	node->ae.avd.auditdeny = ae->avd.auditdeny;
-	node->ae.avd.seqno = ae->avd.seqno;
-	aeref->ae = &node->ae;
+	if (node) {
+		hvalue = avc_hash(ssid, tsid, tclass);
+
+		node->ae.ssid = ssid;
+		node->ae.tsid = tsid;
+		node->ae.tclass = tclass;
+		atomic_set(&node->ae.used, 1);
+
+		node->ae.avd.allowed = ae->avd.allowed;
+		node->ae.avd.decided = ae->avd.decided;
+		node->ae.avd.auditallow = ae->avd.auditallow;
+		node->ae.avd.auditdeny = ae->avd.auditdeny;
+		node->ae.avd.seqno = ae->avd.seqno;
+
+		spin_lock(&avc_cache.slots_lock[hvalue]);
+		list_for_each_entry(pos, &avc_cache.slots[hvalue], list){
+			if( pos->ae.ssid == ssid &&
+			    pos->ae.tsid == tsid &&
+			    pos->ae.tclass == tclass ){
+				atomic_dec(&avc_cache.active_nodes);
+				kfree(node->container);
+				goto duplicate;
+			}
+		}
+		list_add_rcu(&node->list, &avc_cache.slots[hvalue]);
+duplicate:
+		spin_unlock(&avc_cache.slots_lock[hvalue]);
+	}
 out:
-	return rc;
+	return node;
 }
 
 static inline void avc_print_ipv6_addr(struct audit_buffer *ab,
 				       struct in6_addr *addr, u16 port,
@@ -685,10 +707,32 @@
 {
 	return (x == y || x == SECSID_WILD || y == SECSID_WILD);
 }
 
-static inline void avc_update_node(u32 event, struct avc_node *node, u32 perms)
+static void avc_update_node(u32 event, u32 perms, u32 ssid, u32 tsid, u16 tclass)
 {
+	int hvalue;
+	struct avc_node *pos, *node, *org = NULL;
+
+	/* Lock the target slot */
+	hvalue = avc_hash(ssid, tsid, tclass);
+	spin_lock(&avc_cache.slots_lock[hvalue]);
+
+	list_for_each_entry(pos, &avc_cache.slots[hvalue], list){
+		if ( ssid==pos->ae.ssid &&
+		     tsid==pos->ae.tsid &&
+		     tclass==pos->ae.tclass ){
+			org = pos;
+			break;
+		}
+	}
+
+	if (!org)
+		goto out;
+
+	node = org->reserved;
+	/* Duplicate and Update */
+	memcpy(node, org, sizeof(struct avc_node));
 	switch (event) {
 	case AVC_CALLBACK_GRANT:
 		node->ae.avd.allowed |= perms;
 		break;
@@ -708,40 +752,42 @@
 	case AVC_CALLBACK_AUDITDENY_DISABLE:
 		node->ae.avd.auditdeny &= ~perms;
 		break;
 	}
+	list_replace_rcu(&org->list,&node->list);
+out:
+	spin_unlock(&avc_cache.slots_lock[hvalue]);
 }
 
 static int avc_update_cache(u32 event, u32 ssid, u32 tsid,
                             u16 tclass, u32 perms)
 {
 	struct avc_node *node;
 	int i;
-	unsigned long flags;
 
-	spin_lock_irqsave(&avc_lock,flags);
+	rcu_read_lock();
 
 	if (ssid == SECSID_WILD || tsid == SECSID_WILD) {
 		/* apply to all matching nodes */
 		for (i = 0; i < AVC_CACHE_SLOTS; i++) {
-			for (node = avc_cache.slots[i]; node;
-			     node = node->next) {
+			list_for_each_entry(node, &avc_cache.slots[i], list) {
 				if (avc_sidcmp(ssid, node->ae.ssid) &&
 				    avc_sidcmp(tsid, node->ae.tsid) &&
-				    tclass == node->ae.tclass) {
-					avc_update_node(event,node,perms);
+				    tclass == node->ae.tclass ) {
+					avc_update_node(event, perms, node->ae.ssid
+					                ,node->ae.tsid, node->ae.tclass);
 				}
 			}
 		}
 	} else {
 		/* apply to one node */
 		node = avc_search_node(ssid, tsid, tclass, NULL);
 		if (node) {
-			avc_update_node(event,node,perms);
+			avc_update_node(event, perms, ssid, tsid, tclass);
 		}
 	}
 
-	spin_unlock_irqrestore(&avc_lock,flags);
+	rcu_read_unlock();
 
 	return 0;
 }
 
@@ -751,9 +797,8 @@
 {
 	struct avc_callback_node *c;
 	u32 tretained = 0, cretained = 0;
 	int rc = 0;
-	unsigned long flags;
 
 	/*
 	 * try_revoke only removes permissions from the cache
 	 * state if they are not retained by the object manager.
@@ -786,12 +831,9 @@
 		avc_update_cache(event,ssid,tsid,tclass,perms);
 		*out_retained = tretained;
 	}
 
-	spin_lock_irqsave(&avc_lock,flags);
-	if (seqno > avc_cache.latest_notif)
-		avc_cache.latest_notif = seqno;
-	spin_unlock_irqrestore(&avc_lock,flags);
+	avc_latest_notif_update(seqno, 0);
 
 out:
 	return rc;
 }
@@ -856,34 +898,22 @@
 int avc_ss_reset(u32 seqno)
 {
 	struct avc_callback_node *c;
 	int i, rc = 0;
-	struct avc_node *node, *tmp;
-	unsigned long flags;
+	struct avc_node *node;
 
 	avc_hash_eval("reset");
 
-	spin_lock_irqsave(&avc_lock,flags);
-
+	rcu_read_lock();
 	for (i = 0; i < AVC_CACHE_SLOTS; i++) {
-		node = avc_cache.slots[i];
-		while (node) {
-			tmp = node;
-			node = node->next;
-			tmp->ae.ssid = tmp->ae.tsid = SECSID_NULL;
-			tmp->ae.tclass = SECCLASS_NULL;
-			tmp->ae.avd.allowed = tmp->ae.avd.decided = 0;
-			tmp->ae.avd.auditallow = tmp->ae.avd.auditdeny = 0;
-			tmp->ae.used = 0;
-			tmp->next = avc_node_freelist;
-			avc_node_freelist = tmp;
-			avc_cache.active_nodes--;
+		spin_lock(&avc_cache.slots_lock[i]);
+		list_for_each_entry(node, &avc_cache.slots[i], list){
+			list_del_rcu(&node->list);
+			call_rcu(&node->container->rhead, avc_node_free);
 		}
-		avc_cache.slots[i] = NULL;
+		spin_unlock(&avc_cache.slots_lock[i]);
 	}
-	avc_cache.lru_hint = 0;
-
-	spin_unlock_irqrestore(&avc_lock,flags);
+	rcu_read_unlock();
 
 	for (i = 0; i < AVC_NSTATS; i++)
 		avc_cache_stats[i] = 0;
 
@@ -895,12 +925,9 @@
 				goto out;
 		}
 	}
 
-	spin_lock_irqsave(&avc_lock,flags);
-	if (seqno > avc_cache.latest_notif)
-		avc_cache.latest_notif = seqno;
-	spin_unlock_irqrestore(&avc_lock,flags);
+	avc_latest_notif_update(seqno, 0);
 out:
 	return rc;
 }
 
@@ -949,9 +976,9 @@
  * @ssid: source security identifier
  * @tsid: target security identifier
  * @tclass: target security class
  * @requested: requested permissions, interpreted based on @tclass
- * @aeref:  AVC entry reference
+ * @aeref:  AVC entry reference(not in use)
  * @avd: access vector decisions
  *
  * Check the AVC to determine whether the @requested permissions are granted
  * for the SID pair (@ssid, @tsid), interpreting the permissions
@@ -968,72 +995,46 @@
 int avc_has_perm_noaudit(u32 ssid, u32 tsid,
                          u16 tclass, u32 requested,
                          struct avc_entry_ref *aeref, struct av_decision *avd)
 {
-	struct avc_entry *ae;
-	int rc = 0;
-	unsigned long flags;
+	struct avc_node *node;
 	struct avc_entry entry;
+	int rc = 0;
 	u32 denied;
-	struct avc_entry_ref ref;
-
-	if (!aeref) {
-		avc_entry_ref_init(&ref);
-		aeref = &ref;
-	}
 
-	spin_lock_irqsave(&avc_lock, flags);
+	rcu_read_lock();
 	avc_cache_stats_incr(AVC_ENTRY_LOOKUPS);
-	ae = aeref->ae;
-	if (ae) {
-		if (ae->ssid == ssid &&
-		    ae->tsid == tsid &&
-		    ae->tclass == tclass &&
-		    ((ae->avd.decided & requested) == requested)) {
-			avc_cache_stats_incr(AVC_ENTRY_HITS);
-			ae->used = 1;
-		} else {
-			avc_cache_stats_incr(AVC_ENTRY_DISCARDS);
-			ae = NULL;
-		}
-	}
 
-	if (!ae) {
-		avc_cache_stats_incr(AVC_ENTRY_MISSES);
-		rc = avc_lookup(ssid, tsid, tclass, requested, aeref);
-		if (rc) {
-			spin_unlock_irqrestore(&avc_lock,flags);
-			rc = security_compute_av(ssid,tsid,tclass,requested,&entry.avd);
-			if (rc)
-				goto out;
-			spin_lock_irqsave(&avc_lock, flags);
-			rc = avc_insert(ssid,tsid,tclass,&entry,aeref);
-			if (rc) {
-				spin_unlock_irqrestore(&avc_lock,flags);
-				goto out;
-			}
+	node = avc_lookup(ssid, tsid, tclass, requested);
+	if (!node) {
+		rcu_read_unlock();
+		rc = security_compute_av(ssid,tsid,tclass,requested,&entry.avd);
+		if (rc)
+			goto out;
+		rcu_read_lock();
+		node = avc_insert(ssid,tsid,tclass,&entry);
+		if (!node) {
+			rc = -ENOMEM;
+			rcu_read_unlock();
+			goto out;
 		}
-		ae = aeref->ae;
 	}
-
 	if (avd)
-		memcpy(avd, &ae->avd, sizeof(*avd));
+		memcpy(avd, &node->ae.avd, sizeof(*avd));
 
-	denied = requested & ~(ae->avd.allowed);
+	denied = requested & ~(node->ae.avd.allowed);
 
 	if (!requested || denied) {
 		if (selinux_enforcing) {
-			spin_unlock_irqrestore(&avc_lock,flags);
 			rc = -EACCES;
-			goto out;
 		} else {
-			ae->avd.allowed |= requested;
-			spin_unlock_irqrestore(&avc_lock,flags);
-			goto out;
+			if (node->ae.avd.allowed != (node->ae.avd.allowed|requested))
+				avc_update_node(AVC_CALLBACK_GRANT
+				                ,requested,ssid,tsid,tclass);
 		}
 	}
 
-	spin_unlock_irqrestore(&avc_lock,flags);
+	rcu_read_unlock();
 out:
 	return rc;
 }
 
@@ -1042,9 +1043,9 @@
  * @ssid: source security identifier
  * @tsid: target security identifier
  * @tclass: target security class
  * @requested: requested permissions, interpreted based on @tclass
- * @aeref:  AVC entry reference
+ * @aeref:  AVC entry reference(not in use)
  * @auditdata: auxiliary audit data
  *
  * Check the AVC to determine whether the @requested permissions are granted
  * for the SID pair (@ssid, @tsid), interpreting the permissions
@@ -1061,8 +1062,8 @@
 {
 	struct av_decision avd;
 	int rc;
 
-	rc = avc_has_perm_noaudit(ssid, tsid, tclass, requested, aeref, &avd);
+	rc = avc_has_perm_noaudit(ssid, tsid, tclass, requested, NULL, &avd);
 	avc_audit(ssid, tsid, tclass, requested, &avd, rc, auditdata);
 	return rc;
 }
diff -rNU4 linux-2.6.8.1/security/selinux/include/avc.h linux-2.6.8.1.rcu/security/selinux/include/avc.h
--- linux-2.6.8.1/security/selinux/include/avc.h	2004-08-14 19:54:51.000000000 +0900
+++ linux-2.6.8.1.rcu/security/selinux/include/avc.h	2004-08-20 20:40:50.000000000 +0900
@@ -118,13 +118,11 @@
  */
 
 void __init avc_init(void);
 
-int avc_lookup(u32 ssid, u32 tsid, u16 tclass,
-               u32 requested, struct avc_entry_ref *aeref);
+struct avc_node *avc_lookup(u32 ssid, u32 tsid, u16 tclass, u32 requested);
 
-int avc_insert(u32 ssid, u32 tsid, u16 tclass,
-               struct avc_entry *ae, struct avc_entry_ref *out_aeref);
+struct avc_node *avc_insert(u32 ssid, u32 tsid, u16 tclass, struct avc_entry *ae);
 
 void avc_audit(u32 ssid, u32 tsid,
                u16 tclass, u32 requested,
                struct av_decision *avd, int result, struct avc_audit_data *auditdata);

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RCU issue with SELinux (Re: SELINUX performance issues)
  2004-08-20 14:53     ` James Morris
@ 2004-08-24  7:27       ` Kaigai Kohei
  2004-08-24 13:24         ` James Morris
  0 siblings, 1 reply; 31+ messages in thread
From: Kaigai Kohei @ 2004-08-24  7:27 UTC (permalink / raw)
  To: James Morris; +Cc: Stephen Smalley, SELinux-ML(Eng), Linux Kernel ML(Eng)

Hi James, Thanks for your comments.

> Do you have figures for 1 and 2 CPU?

The results of 1CPU and 2CPU  are following:  (2.6.8.1-RCU by take2 patch)
[write() to files on tmpfs Loop=500000 Parallel=<Num of CPU>]
                 -- 1CPU-- -- 2CPU-- -- 4CPU-- -- 8CPU-- --16CPU-- --32CPU--
2.6.8.1(disable)    8.2959    8.0430    8.0158    8.0183    8.0146    8.0037
2.6.8.1(enable)    11.8427   14.0358   78.0957  319.0451 1322.0313  too long
2.6.8.1.rwlock     11.2485   13.8688   20.0100   49.0390   90.0200  177.0533
2.6.8.1.rcu        11.3435   11.3319   11.0464   11.0205   11.0372   11.0496


> Also, can you run some more benchmarks, e.g. lmbench, unixbench, dbench 
> etc?

The results of unixbench and dbench are following:

o UNIXbench

* INDEX value comparison
                                       2.6.8.1   2.6.8.1   2.6.8.1   2.6.8.1
                                     (Disable)  (Enable)  (rwlock)     (RCU)
Dhrystone 2 using register variables    268.9     268.8     269.2     269.0
Double-Precision Whetstone               94.2      94.2      94.2      94.2
Execl Throughput                        388.3     379.0     377.8     377.9
File Copy 1024 bufsize 2000 maxblocks   606.6     526.6     515.6     504.8
File Copy 256 bufsize 500 maxblocks     508.9     417.0     410.4     395.2
File Copy 4096 bufsize 8000 maxblocks   987.1     890.4     876.0     857.9
Pipe Throughput                         525.1     406.4     404.5     408.8
Process Creation                        321.2     317.8     315.9     316.3
Shell Scripts (8 concurrent)           1312.8    1276.2    1278.8    1282.8
System Call Overhead                    467.1     468.7     464.1     467.2
                                    ========================================
     FINAL SCORE                        445.8     413.2     410.1     407.7


o dbench [ 4 processes run parallely on 4-CPUs / 10 times trials ]
                  ---- mean ----  - STD -
2.6.8.1(disable)  860.249 [MB/s]   44.683
2.6.8.1(enable)   714.254 [MB/s]   32.359
2.6.8.1(+rwlock)  767.904 [MB/s]   27.968
2.6.8.1(+RCU)     830.678 [MB/s]   16.352


> > > + hvalue = atomic_inc_return(&avc_cache.lru_hint) % AVC_CACHE_SLOTS;
> 
> atomic_inc_return() is not implemented on ia32 or x86-64.  Is there a 
> workaround, or do we need to implement it?  (Andi Kleen suggested using 
> the xadd instruction and altinstructions for i386).

Oops!
In IA-32 or x86_64, can anybady implement atomic_inc_return()?
If it can not, I'll try to make alternative macros or inline functions.

Thanks.
--------
Kai Gai <kaigai@ak.jp.nec.com>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RCU issue with SELinux (Re: SELINUX performance issues)
  2004-08-20 20:19     ` Paul E. McKenney
  2004-08-20 20:35       ` James Morris
@ 2004-08-24  7:27       ` Kaigai Kohei
  1 sibling, 0 replies; 31+ messages in thread
From: Kaigai Kohei @ 2004-08-24  7:27 UTC (permalink / raw)
  To: paulmck
  Cc: James Morris, Stephen Smalley, SELinux-ML(Eng), Linux Kernel ML(Eng)

Hi Paul, Thanks for your comments.

> A few comments interspersed for the larger patch (look for empty lines).
> I am missing why some of the list insertions and deletions are SMP safe.

Sorry, some of them were my bug.

> > + if (spin_trylock(&avc_cache.slots_lock[hvalue])) {
> > + list_for_each_rcu(pos, &avc_cache.slots[hvalue]) {
> 
> Since we are holding the lock, the list cannot change, and the _rcu()
> is not needed.  Right?

No, only insert/update/delete paths must hold the lock.
The _rcu suffix is necessary since read-only path does not hold the lock.
(It was my bug the inserting path did not hold the lock.)

> > + node = list_entry(pos, struct avc_node, list);
> 
> Why not just do:
> 
> list_for_each_entry(pos, &avc_cache.slots[hvalue], list) {
> 
> rather than having the separate list_entry().

Your suggestion is better. I fixed it.

> > + if (org) {
> > + if (!new) {
> 
> If we run out of memory, we silently delete the node that we were
> wanting to update?  Is this really what you want?
> 
> OK, OK, given that the "C" in "AVC" stands for "cache", I guess that
> deleting the element does indeed make sense...

Indeed, I thought it is pretty rough, and fixed avc_insert().
Now, kmalloc() is unecessary in avc_insert().
Therefore, the updating always succeeds.
(But pretty more memory is required.)

> > + switch (event) {
> > + case AVC_CALLBACK_GRANT:
> > + new->ae.avd.allowed |= perms;
> 
> This is a 32-bit field, and is protected by a lock.  Therefore, only
> one update should be happening at a time.  This field is 32-bit aligned
> (right?), and so the write that does the update will appear atomic to
> readers.  Only one of the the updates happens in a given call.

Is it about "new->ae.avd.allowed |= perms;" ?
No others can refer the 'new', since the 'new' is the duplicated entry
of the 'org'. It is safe irrespective of the lock held or unheld.

In addition, the problem which had its roots in avc_insert()
was solved by the take2 patch.

> o There is some bizarre CPU out there that does not do
> atomic 32-bit writes (but this would break all sorts of
> stuff).
> 
> o Consecutive changes might give a slow reader the incorrect
> impression that permissions are wide open.  One way to take
> care of this would be to force a grace period between each
> change.  One way to do this would be to set a bit indicating
> an update, and have the call_rcu() clear it.  When getting
> ready to update, if the bit is still set, block until the
> bit is cleared.  This would guarantee that a given reader
> would see at most one update.
> 
> But this would require considerable code restructuring, since
> this code is still under an rcu_read_lock().  But putting
> forward the thought anyway in case it is helpful.
> 
> If updates are -really- rare, a simple way to make this happen
> would be to protect the updates with a sema_t, and just do
> an unconditional synchronize_kernel() just before releasing
> the sema_t.
> 
> Since the current code permits readers to see changes to
> several different nodes, a single synchronize_kernel() should
> suffice.

> > + list_del_rcu(&node->list);
> 
> I don't see what prevents multiple list_del_rcu()s from executing in
> parallel, mangling the list.  Is there some higher-level lock?
> If so, why do we need RCU protecting the list?

I agree, and fixed it.

> > + rcu_read_lock();
> > + node = avc_insert(ssid,tsid,tclass,&entry);
> 
> I don't see what prevents two copies of avc_insert from running in parallel.
> Seems like this would trash the lists.  Or, if there is a higher-level lock
> that I am missing, why do we need RCU protecting the lists?

I agree, and fixed it.

Thanks.
--------
Kai Gai <kaigai@ak.jp.nec.com>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RCU issue with SELinux (Re: SELINUX performance issues)
  2004-08-24  7:27       ` Kaigai Kohei
@ 2004-08-24 13:24         ` James Morris
  2004-08-25  9:51           ` Kaigai Kohei
  2004-08-25  9:52           ` [PATCH]atomic_inc_return() for i386/x86_64 (Re: RCU issue with SELinux) Kaigai Kohei
  0 siblings, 2 replies; 31+ messages in thread
From: James Morris @ 2004-08-24 13:24 UTC (permalink / raw)
  To: Kaigai Kohei; +Cc: Stephen Smalley, SELinux-ML(Eng), Linux Kernel ML(Eng)

On Tue, 24 Aug 2004, Kaigai Kohei wrote:

> o UNIXbench
> 
> * INDEX value comparison
>                                        2.6.8.1   2.6.8.1   2.6.8.1   2.6.8.1
>                                      (Disable)  (Enable)  (rwlock)     (RCU)
> Dhrystone 2 using register variables    268.9     268.8     269.2     269.0
> Double-Precision Whetstone               94.2      94.2      94.2      94.2
> Execl Throughput                        388.3     379.0     377.8     377.9 +
> File Copy 1024 bufsize 2000 maxblocks   606.6     526.6     515.6     504.8 *
> File Copy 256 bufsize 500 maxblocks     508.9     417.0     410.4     395.2 *
> File Copy 4096 bufsize 8000 maxblocks   987.1     890.4     876.0     857.9 *
> Pipe Throughput                         525.1     406.4     404.5     408.8 +
> Process Creation                        321.2     317.8     315.9     316.3 +
> Shell Scripts (8 concurrent)           1312.8    1276.2    1278.8    1282.8 +
> System Call Overhead                    467.1     468.7     464.1     467.2 +
>                                     ========================================
>      FINAL SCORE                        445.8     413.2     410.1     407.7

This benchmark somewhat characterizes 1P performance, and the ones I've 
marked with (*) get noticably worse with the RCU patch compared to the 
current locking scheme.  Tests marked (+) show no or insignificant 
improvement.

Might be useful to compare with the lmbench macrobenchmark, to see if it 
shows a similar pattern.

> o dbench [ 4 processes run parallely on 4-CPUs / 10 times trials ]
>                   ---- mean ----  - STD -
> 2.6.8.1(disable)  860.249 [MB/s]   44.683
> 2.6.8.1(enable)   714.254 [MB/s]   32.359
> 2.6.8.1(+rwlock)  767.904 [MB/s]   27.968
> 2.6.8.1(+RCU)     830.678 [MB/s]   16.352

Can you show the figures for 1 and 2 clients?

> In IA-32 or x86_64, can anybady implement atomic_inc_return()?
> If it can not, I'll try to make alternative macros or inline functions.

If you can get this done, it will be very useful, as I could allso run 
some benchmarks on my test systems.


- James
-- 
James Morris
<jmorris@redhat.com>






^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RCU issue with SELinux (Re: SELINUX performance issues)
  2004-08-24  7:25       ` Kaigai Kohei
@ 2004-08-24 15:37         ` Stephen Smalley
  2004-08-25  9:51           ` Kaigai Kohei
  2004-08-24 23:02         ` RCU issue with SELinux (Re: SELINUX performance issues) Paul E. McKenney
  1 sibling, 1 reply; 31+ messages in thread
From: Stephen Smalley @ 2004-08-24 15:37 UTC (permalink / raw)
  To: Kaigai Kohei; +Cc: SELinux-ML(Eng), Linux Kernel ML(Eng), James Morris

On Tue, 2004-08-24 at 03:25, Kaigai Kohei wrote:
> You are right. Indeed, the lock for hash bucket is also necessary
> when avc_insert() is called. I fixed them.

avc_has_perm* can be called from interrupt or bh, e.g. send_sigio or
sock_rcv_skb.  So using just spin_lock/spin_unlock rather than
spin_lock_irqsave/restore is unsafe, right?
 
-- 
Stephen Smalley <sds@epoch.ncsc.mil>
National Security Agency


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RCU issue with SELinux (Re: SELINUX performance issues)
  2004-08-24  7:25       ` Kaigai Kohei
  2004-08-24 15:37         ` Stephen Smalley
@ 2004-08-24 23:02         ` Paul E. McKenney
  2004-08-25  9:51           ` Kaigai Kohei
  1 sibling, 1 reply; 31+ messages in thread
From: Paul E. McKenney @ 2004-08-24 23:02 UTC (permalink / raw)
  To: Kaigai Kohei
  Cc: Stephen Smalley, SELinux-ML(Eng), Linux Kernel ML(Eng), James Morris

On Tue, Aug 24, 2004 at 04:25:32PM +0900, Kaigai Kohei wrote:
> Hi Stephen, Thanks for your comments.
> 
> > I'm not overly familiar with RCU myself, but the comments in list.h for
> > list_add_rcu suggest that you still need to hold a lock to avoid racing
> > with another list_add_rcu or list_del_rcu call on the same list.  But
> > avc_insert is calling list_add_rcu without holding any lock; can't it
> > race with another avc_insert on the same hash bucket?  Do I just
> > misunderstand, or is this unsafe?  Thanks for clarifying.
> 
> You are right. Indeed, the lock for hash bucket is also necessary
> when avc_insert() is called. I fixed them.
> 
> > I think we can likely eliminate the mutation of the node in the
> > !selinux_enforcing case in avc_has_perm_noaudit, i.e. eliminate the
> > entire else clause and just fall through with rc still 0.  Adding the
> > requested permissions to the node was simply to avoid flooding denials
> > in permissive mode on the same permission check, but this can be
> > addressed separately using the audit ratelimit mechanism.
> 
> I have another opinion.
> This simple mechanism against the flood of audit log is necessary,
> because it prevents the depletion of the system log buffer and denied log
> all over the console when we are debugging the security policy in permissive mode.
> So, I improved the avc_update_node() function and avc_node data structure.
> It does not need kmalloc() when avc_update_node(). 
> 
> This approach is good for durability and compatibility of original implementation,
> but double area of avc_nodes is needed for updating without kmalloc().
> This approach can apply to any kinds of updating of avc_entry.
> This idea is pretty complexer, though.
> 
> I modified the following points:
> - We hold the lock for hash backet when avc_insert() and avc_ss_reset() are
>   called for safety.
> - list_for_each_rcu() and list_entry() are replaced by list_for_entry().

One subtlety here...

The traversals that are protected by rcu_read_lock() (rather than an
update-side spinlock) need to be list_for_each_entry_rcu() rather than
list_for_each_entry().  The "_rcu()" is required in order to work
reliably on Alpha, and has the added benefit of calling out exactly
which traversals are RCU-protected.

Update-side code remains list_for_each_entry().

> - avc_node_dual structure which contains two avc_node objects is defined. 
>   It allows to do avc_update_node() without kmalloc() or any locks.

What happens when you have two consecutive updates to the same object?
Don't you have to defer the second update until a grace period has
elapsed since the first update in order to avoid confusing readers that
are still accessing the original version?

One way to do this would be to set a "don't-touch-me" bit that is
cleared by an RCU callback.  An update to an element with the
"don't-touch-me" bit set would block until the bit clears.  There
are probably better ways...

							Thanx, Paul

> Any comments please. Thanks.
> --------
> Kai Gai <kaigai@ak.jp.nec.com>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RCU issue with SELinux (Re: SELINUX performance issues)
  2004-08-24 15:37         ` Stephen Smalley
@ 2004-08-25  9:51           ` Kaigai Kohei
  2004-08-25 15:50             ` Stephen Smalley
  0 siblings, 1 reply; 31+ messages in thread
From: Kaigai Kohei @ 2004-08-25  9:51 UTC (permalink / raw)
  To: Stephen Smalley; +Cc: SELinux-ML(Eng), Linux Kernel ML(Eng), James Morris

[-- Attachment #1: Type: text/plain, Size: 1047 bytes --]

Hi Stephen, thanks for your comment.

> > You are right. Indeed, the lock for hash bucket is also necessary
> > when avc_insert() is called. I fixed them.
> 
> avc_has_perm* can be called from interrupt or bh, e.g. send_sigio or
> sock_rcv_skb.  So using just spin_lock/spin_unlock rather than
> spin_lock_irqsave/restore is unsafe, right?

Indeed, spin_lock/spin_unlock should be replaced by spin_lock_irqsave/restore.
I fixed it.

The attached take3-patch is modified as follows:
- avc_node_dual was eliminated by Paul E.McKenny's suggestion.
  avc_update_node() calls kmalloc() and may return -ENOMEM.
  (But, I think this effect is so limited.)
- All list_for_each_entry() were replaced by list_for_each_entry_rcu().
- All spin_lock()/spin_unlock() were replaced by spin_lock_irqsave()
  /spin_unlock_restore().
- In avc_node_insert(), if an entry with the same ssid/tsid/tclass as new
  one exists, the older entry is replaced by the new one.

Thanks. I want to make it the last edition hopefully. :)

--------
Kai Gai <kaigai@ak.jp.nec.com>

[-- Attachment #2: list_replace_rcu-2.6.8.1.patch --]
[-- Type: application/octet-stream, Size: 845 bytes --]

--- linux-2.6.8.1/include/linux/list.h	2004-08-14 19:55:33.000000000 +0900
+++ linux-2.6.8.1.rcu/include/linux/list.h	2004-08-20 18:04:10.000000000 +0900
@@ -194,8 +194,23 @@
 	__list_del(entry->prev, entry->next);
 	entry->prev = LIST_POISON2;
 }
 
+/*
+ * list_replace_rcu - replace old entry by new onw from list
+ * @old : the element to be replaced from the list.
+ * @new : the new element to insert to the list.
+ * 
+ * The old entry will be replaced to the new entry atomically.
+ */
+static inline void list_replace_rcu(struct list_head *old, struct list_head *new){
+	new->next = old->next;
+	new->prev = old->prev;
+	smp_wmb();
+	new->next->prev = new;
+	new->prev->next = new;
+}
+
 /**
  * list_del_init - deletes entry from list and reinitialize it.
  * @entry: the element to delete from the list.
  */

[-- Attachment #3: selinux.rcu-2.6.8.1-take3.patch --]
[-- Type: application/octet-stream, Size: 21275 bytes --]

diff -rNU4 linux-2.6.8.1/security/selinux/avc.c linux-2.6.8.1.rcu/security/selinux/avc.c
--- linux-2.6.8.1/security/selinux/avc.c	2004-08-14 19:55:48.000000000 +0900
+++ linux-2.6.8.1.rcu/security/selinux/avc.c	2004-08-25 15:37:20.000000000 +0900
@@ -35,28 +35,32 @@
 #include "av_perm_to_string.h"
 #include "objsec.h"
 
 #define AVC_CACHE_SLOTS		512
-#define AVC_CACHE_MAXNODES	410
+#define AVC_CACHE_THRESHOLD	410
+#define AVC_CACHE_RECLAIM	16
 
 struct avc_entry {
 	u32			ssid;
 	u32			tsid;
 	u16			tclass;
 	struct av_decision	avd;
-	int			used;	/* used recently */
+	atomic_t		used;	/* used recently */
 };
 
 struct avc_node {
 	struct avc_entry	ae;
-	struct avc_node		*next;
+	struct list_head	list;
+	struct rcu_head         rhead;
 };
 
 struct avc_cache {
-	struct avc_node	*slots[AVC_CACHE_SLOTS];
-	u32		lru_hint;	/* LRU hint for reclaim scan */
-	u32		active_nodes;
-	u32		latest_notif;	/* latest revocation notification */
+	struct list_head	slots[AVC_CACHE_SLOTS];
+	spinlock_t		slots_lock[AVC_CACHE_SLOTS];
+		/* lock for insert/update/delete and reset */
+	atomic_t		lru_hint;	/* LRU hint for reclaim scan */
+	atomic_t		active_nodes;
+	u32			latest_notif;	/* latest revocation notification */
 };
 
 struct avc_callback_node {
 	int (*callback) (u32 event, u32 ssid, u32 tsid,
@@ -69,10 +73,8 @@
 	u32 perms;
 	struct avc_callback_node *next;
 };
 
-static spinlock_t avc_lock = SPIN_LOCK_UNLOCKED;
-static struct avc_node *avc_node_freelist;
 static struct avc_cache avc_cache;
 static unsigned avc_cache_stats[AVC_NSTATS];
 static struct avc_callback_node *avc_callbacks;
 
@@ -187,23 +189,17 @@
  * Initialize the access vector cache.
  */
 void __init avc_init(void)
 {
-	struct avc_node	*new;
 	int i;
 
-	for (i = 0; i < AVC_CACHE_MAXNODES; i++) {
-		new = kmalloc(sizeof(*new), GFP_ATOMIC);
-		if (!new) {
-			printk(KERN_WARNING "avc:  only able to allocate "
-			       "%d entries\n", i);
-			break;
-		}
-		memset(new, 0, sizeof(*new));
-		new->next = avc_node_freelist;
-		avc_node_freelist = new;
-	}
-
+	for (i =0; i < AVC_CACHE_SLOTS; i++) {
+		INIT_LIST_HEAD(&avc_cache.slots[i]);
+		avc_cache.slots_lock[i] = SPIN_LOCK_UNLOCKED;
+	}
+	atomic_set(&avc_cache.active_nodes, 0);
+	atomic_set(&avc_cache.lru_hint, 0);
+	
 	audit_log(current->audit_context, "AVC INITIALIZED\n");
 }
 
 #if 0
@@ -212,27 +208,24 @@
 	int i, chain_len, max_chain_len, slots_used;
 	struct avc_node *node;
 	unsigned long flags;
 
-	spin_lock_irqsave(&avc_lock,flags);
+	rcu_read_lock();
 
 	slots_used = 0;
 	max_chain_len = 0;
 	for (i = 0; i < AVC_CACHE_SLOTS; i++) {
-		node = avc_cache.slots[i];
-		if (node) {
+		if (!list_empty(&avc_cache.slots[i])) {
 			slots_used++;
 			chain_len = 0;
-			while (node) {
+			list_for_each_entry_rcu(node, &avc_cache.slots[i])
 				chain_len++;
-				node = node->next;
-			}
 			if (chain_len > max_chain_len)
 				max_chain_len = chain_len;
 		}
 	}
 
-	spin_unlock_irqrestore(&avc_lock,flags);
+	rcu_read_unlock();
 
 	printk(KERN_INFO "\n");
 	printk(KERN_INFO "%s avc:  %d entries and %d/%d buckets used, longest "
 	       "chain length %d\n", tag, avc_cache.active_nodes, slots_used,
@@ -242,188 +235,207 @@
 static inline void avc_hash_eval(char *tag)
 { }
 #endif
 
-static inline struct avc_node *avc_reclaim_node(void)
+static void avc_node_free(struct rcu_head *rhead) {
+	struct avc_node *node;
+	node = container_of(rhead, struct avc_node, rhead);
+	kfree(node);
+}
+
+static inline int avc_reclaim_node(void)
 {
-	struct avc_node *prev, *cur;
-	int hvalue, try;
+	struct avc_node *node;
+	int hvalue, try, ecx;
 
-	hvalue = avc_cache.lru_hint;
-	for (try = 0; try < 2; try++) {
-		do {
-			prev = NULL;
-			cur = avc_cache.slots[hvalue];
-			while (cur) {
-				if (!cur->ae.used)
-					goto found;
+	for (try = 0, ecx = 0; try < AVC_CACHE_SLOTS; try++ ) {
+		hvalue = atomic_inc_return(&avc_cache.lru_hint) % AVC_CACHE_SLOTS;
 
-				cur->ae.used = 0;
+		if (!spin_trylock(&avc_cache.slots_lock[hvalue]))
+			continue;
 
-				prev = cur;
-				cur = cur->next;
+		list_for_each_entry_rcu(node, &avc_cache.slots[hvalue], list) {
+			if (!atomic_dec_and_test(&node->ae.used)) {
+				/* Recently Unused */
+				list_del_rcu(&node->list);
+				call_rcu(&node->rhead, avc_node_free);
+				atomic_dec(&avc_cache.active_nodes);
+				ecx++;
+				if (ecx >= AVC_CACHE_RECLAIM) {
+					spin_unlock(&avc_cache.slots_lock[hvalue]);
+					goto out;
+				}
 			}
-			hvalue = (hvalue + 1) & (AVC_CACHE_SLOTS - 1);
-		} while (hvalue != avc_cache.lru_hint);
+		}
+		spin_unlock(&avc_cache.slots_lock[hvalue]);
 	}
-
-	panic("avc_reclaim_node");
-
-found:
-	avc_cache.lru_hint = hvalue;
-
-	if (prev == NULL)
-		avc_cache.slots[hvalue] = cur->next;
-	else
-		prev->next = cur->next;
-
-	return cur;
+out:
+	return ecx;
 }
 
-static inline struct avc_node *avc_claim_node(u32 ssid,
-                                              u32 tsid, u16 tclass)
+static inline struct avc_node *avc_get_node(void)
 {
 	struct avc_node *new;
-	int hvalue;
+	int actives;
 
-	hvalue = avc_hash(ssid, tsid, tclass);
-	if (avc_node_freelist) {
-		new = avc_node_freelist;
-		avc_node_freelist = avc_node_freelist->next;
-		avc_cache.active_nodes++;
-	} else {
-		new = avc_reclaim_node();
-		if (!new)
-			goto out;
-	}
-
-	new->ae.used = 1;
-	new->ae.ssid = ssid;
-	new->ae.tsid = tsid;
-	new->ae.tclass = tclass;
-	new->next = avc_cache.slots[hvalue];
-	avc_cache.slots[hvalue] = new;
+	new = kmalloc(sizeof(struct avc_node), GFP_ATOMIC);
+	if (!new)
+		return NULL;
+
+	INIT_RCU_HEAD(&new->rhead);
+	INIT_LIST_HEAD(&new->list);
+
+	actives = atomic_inc_return(&avc_cache.active_nodes);
+	if (actives > AVC_CACHE_THRESHOLD)
+		avc_reclaim_node();
 
-out:
 	return new;
 }
 
 static inline struct avc_node *avc_search_node(u32 ssid, u32 tsid,
                                                u16 tclass, int *probes)
 {
-	struct avc_node *cur;
+	struct avc_node *node, *ret = NULL;
 	int hvalue;
 	int tprobes = 1;
 
 	hvalue = avc_hash(ssid, tsid, tclass);
-	cur = avc_cache.slots[hvalue];
-	while (cur != NULL &&
-	       (ssid != cur->ae.ssid ||
-		tclass != cur->ae.tclass ||
-		tsid != cur->ae.tsid)) {
+	list_for_each_entry_rcu(node, &avc_cache.slots[hvalue], list) {
+		if (ssid == node->ae.ssid &&
+		    tclass == node->ae.tclass &&
+		    tsid == node->ae.tsid) {
+			ret = node;
+			break;
+		}
 		tprobes++;
-		cur = cur->next;
 	}
 
-	if (cur == NULL) {
+	if (ret == NULL) {
 		/* cache miss */
 		goto out;
 	}
 
 	/* cache hit */
 	if (probes)
 		*probes = tprobes;
-
-	cur->ae.used = 1;
-
+	if (atomic_read(&ret->ae.used) != 1)
+		atomic_set(&ret->ae.used, 1);
 out:
-	return cur;
+	return ret;
 }
 
 /**
  * avc_lookup - Look up an AVC entry.
  * @ssid: source security identifier
  * @tsid: target security identifier
  * @tclass: target security class
  * @requested: requested permissions, interpreted based on @tclass
- * @aeref:  AVC entry reference
  *
  * Look up an AVC entry that is valid for the
  * @requested permissions between the SID pair
  * (@ssid, @tsid), interpreting the permissions
  * based on @tclass.  If a valid AVC entry exists,
- * then this function updates @aeref to refer to the
- * entry and returns %0. Otherwise, this function
- * returns -%ENOENT.
+ * then this function return the avc_node.
+ * Otherwise, this function returns NULL.
  */
-int avc_lookup(u32 ssid, u32 tsid, u16 tclass,
-               u32 requested, struct avc_entry_ref *aeref)
+struct avc_node *avc_lookup(u32 ssid, u32 tsid, u16 tclass, u32 requested)
 {
 	struct avc_node *node;
-	int probes, rc = 0;
+	int probes;
 
 	avc_cache_stats_incr(AVC_CAV_LOOKUPS);
 	node = avc_search_node(ssid, tsid, tclass,&probes);
 
 	if (node && ((node->ae.avd.decided & requested) == requested)) {
 		avc_cache_stats_incr(AVC_CAV_HITS);
 		avc_cache_stats_add(AVC_CAV_PROBES,probes);
-		aeref->ae = &node->ae;
 		goto out;
 	}
 
 	avc_cache_stats_incr(AVC_CAV_MISSES);
-	rc = -ENOENT;
 out:
-	return rc;
+	return node;
+}
+
+static int avc_latest_notif_update(int seqno, int is_insert)
+{
+	int ret = 0;
+	static spinlock_t notif_lock = SPIN_LOCK_UNLOCKED;
+	unsigned long flag;
+
+	spin_lock_irqsave(&notif_lock, flag);
+	if (seqno < avc_cache.latest_notif) {
+		if (is_insert) {
+			printk(KERN_WARNING "avc:  seqno %d < latest_notif %d\n",
+			       seqno, avc_cache.latest_notif);
+			ret = -EAGAIN;
+		} else {
+			avc_cache.latest_notif = seqno;
+		}
+	}
+	spin_unlock_irqrestore(&notif_lock, flag);
+	return ret;
 }
 
 /**
  * avc_insert - Insert an AVC entry.
  * @ssid: source security identifier
  * @tsid: target security identifier
  * @tclass: target security class
  * @ae: AVC entry
- * @aeref:  AVC entry reference
  *
  * Insert an AVC entry for the SID pair
  * (@ssid, @tsid) and class @tclass.
  * The access vectors and the sequence number are
  * normally provided by the security server in
  * response to a security_compute_av() call.  If the
  * sequence number @ae->avd.seqno is not less than the latest
  * revocation notification, then the function copies
- * the access vectors into a cache entry, updates
- * @aeref to refer to the entry, and returns %0.
- * Otherwise, this function returns -%EAGAIN.
+ * the access vectors into a cache entry, returns
+ * avc_node inserted. Otherwise, this function returns NULL.
  */
-int avc_insert(u32 ssid, u32 tsid, u16 tclass,
-               struct avc_entry *ae, struct avc_entry_ref *aeref)
+struct avc_node *avc_insert(u32 ssid, u32 tsid, u16 tclass, struct avc_entry *ae)
 {
-	struct avc_node *node;
-	int rc = 0;
+	struct avc_node *pos, *node = NULL;
+	int hvalue;
+	unsigned long flag;
 
-	if (ae->avd.seqno < avc_cache.latest_notif) {
-		printk(KERN_WARNING "avc:  seqno %d < latest_notif %d\n",
-		       ae->avd.seqno, avc_cache.latest_notif);
-		rc = -EAGAIN;
+	if (avc_latest_notif_update(ae->avd.seqno, 1))
 		goto out;
-	}
 
-	node = avc_claim_node(ssid, tsid, tclass);
-	if (!node) {
-		rc = -ENOMEM;
-		goto out;
-	}
+	node = avc_get_node();
+
+	if (node) {
+		hvalue = avc_hash(ssid, tsid, tclass);
 
-	node->ae.avd.allowed = ae->avd.allowed;
-	node->ae.avd.decided = ae->avd.decided;
-	node->ae.avd.auditallow = ae->avd.auditallow;
-	node->ae.avd.auditdeny = ae->avd.auditdeny;
-	node->ae.avd.seqno = ae->avd.seqno;
-	aeref->ae = &node->ae;
+		node->ae.ssid = ssid;
+		node->ae.tsid = tsid;
+		node->ae.tclass = tclass;
+		atomic_set(&node->ae.used, 1);
+
+		node->ae.avd.allowed = ae->avd.allowed;
+		node->ae.avd.decided = ae->avd.decided;
+		node->ae.avd.auditallow = ae->avd.auditallow;
+		node->ae.avd.auditdeny = ae->avd.auditdeny;
+		node->ae.avd.seqno = ae->avd.seqno;
+
+		spin_lock_irqsave(&avc_cache.slots_lock[hvalue], flag);
+		list_for_each_entry_rcu(pos, &avc_cache.slots[hvalue], list) {
+			if (pos->ae.ssid == ssid &&
+			    pos->ae.tsid == tsid &&
+			    pos->ae.tclass == tclass) {
+				list_replace_rcu(&pos->list, &node->list);
+				call_rcu(&pos->rhead, avc_node_free);
+				atomic_dec(&avc_cache.active_nodes);
+				goto found;
+			}
+		}
+		list_add_rcu(&node->list, &avc_cache.slots[hvalue]);
+found:
+		spin_unlock_irqrestore(&avc_cache.slots_lock[hvalue], flag);
+	}
 out:
-	return rc;
+	return node;
 }
 
 static inline void avc_print_ipv6_addr(struct audit_buffer *ab,
 				       struct in6_addr *addr, u16 port,
@@ -685,10 +697,51 @@
 {
 	return (x == y || x == SECSID_WILD || y == SECSID_WILD);
 }
 
-static inline void avc_update_node(u32 event, struct avc_node *node, u32 perms)
+/**
+ * avc_update_node Update an AVC entry
+ * @event : Updating event
+ * @perms : Permission mask bits
+ * @ssid,@tsid,@tclass : identifier of an AVC entry
+ * 
+ * if a valid AVC entry doesn't exist,this function returns -ENOENT.
+ * if kmalloc() called internal returns NULL, this function returns -ENOMEM.
+ * otherwise, this function update the AVC entry. The original AVC-entry object
+ * will release later by RCU.
+ */
+static int avc_update_node(u32 event, u32 perms, u32 ssid, u32 tsid, u16 tclass)
 {
+	int hvalue, rc = 0;
+	unsigned long flag;
+	struct avc_node *pos, *node, *org = NULL;
+
+	/* Lock the target slot */
+	hvalue = avc_hash(ssid, tsid, tclass);
+	spin_lock_irqsave(&avc_cache.slots_lock[hvalue], flag);
+
+	list_for_each_entry_rcu(pos, &avc_cache.slots[hvalue], list){
+		if ( ssid==pos->ae.ssid &&
+		     tsid==pos->ae.tsid &&
+		     tclass==pos->ae.tclass ){
+			org = pos;
+			break;
+		}
+	}
+
+	if (!org) {
+		rc = -ENOENT;
+		goto out;
+	}
+
+	node = kmalloc(sizeof(struct avc_node), GFP_ATOMIC);
+	if (!node) {
+		rc = -ENOMEM;
+		goto out;
+	}
+
+	/* Duplicate and Update */
+	memcpy(node, org, sizeof(struct avc_node));
 	switch (event) {
 	case AVC_CALLBACK_GRANT:
 		node->ae.avd.allowed |= perms;
 		break;
@@ -708,40 +761,41 @@
 	case AVC_CALLBACK_AUDITDENY_DISABLE:
 		node->ae.avd.auditdeny &= ~perms;
 		break;
 	}
+	list_replace_rcu(&org->list,&node->list);
+out:
+	spin_unlock_irqrestore(&avc_cache.slots_lock[hvalue], flag);
+
+	return rc;
 }
 
 static int avc_update_cache(u32 event, u32 ssid, u32 tsid,
                             u16 tclass, u32 perms)
 {
 	struct avc_node *node;
 	int i;
-	unsigned long flags;
 
-	spin_lock_irqsave(&avc_lock,flags);
+	rcu_read_lock();
 
 	if (ssid == SECSID_WILD || tsid == SECSID_WILD) {
 		/* apply to all matching nodes */
 		for (i = 0; i < AVC_CACHE_SLOTS; i++) {
-			for (node = avc_cache.slots[i]; node;
-			     node = node->next) {
+			list_for_each_entry_rcu(node, &avc_cache.slots[i], list) {
 				if (avc_sidcmp(ssid, node->ae.ssid) &&
 				    avc_sidcmp(tsid, node->ae.tsid) &&
-				    tclass == node->ae.tclass) {
-					avc_update_node(event,node,perms);
+				    tclass == node->ae.tclass ) {
+					avc_update_node(event, perms, node->ae.ssid
+					                ,node->ae.tsid, node->ae.tclass);
 				}
 			}
 		}
 	} else {
 		/* apply to one node */
-		node = avc_search_node(ssid, tsid, tclass, NULL);
-		if (node) {
-			avc_update_node(event,node,perms);
-		}
+		avc_update_node(event, perms, ssid, tsid, tclass);
 	}
 
-	spin_unlock_irqrestore(&avc_lock,flags);
+	rcu_read_unlock();
 
 	return 0;
 }
 
@@ -751,9 +805,8 @@
 {
 	struct avc_callback_node *c;
 	u32 tretained = 0, cretained = 0;
 	int rc = 0;
-	unsigned long flags;
 
 	/*
 	 * try_revoke only removes permissions from the cache
 	 * state if they are not retained by the object manager.
@@ -786,12 +839,9 @@
 		avc_update_cache(event,ssid,tsid,tclass,perms);
 		*out_retained = tretained;
 	}
 
-	spin_lock_irqsave(&avc_lock,flags);
-	if (seqno > avc_cache.latest_notif)
-		avc_cache.latest_notif = seqno;
-	spin_unlock_irqrestore(&avc_lock,flags);
+	avc_latest_notif_update(seqno, 0);
 
 out:
 	return rc;
 }
@@ -856,34 +906,23 @@
 int avc_ss_reset(u32 seqno)
 {
 	struct avc_callback_node *c;
 	int i, rc = 0;
-	struct avc_node *node, *tmp;
-	unsigned long flags;
+	unsigned long flag;
+	struct avc_node *node;
 
 	avc_hash_eval("reset");
 
-	spin_lock_irqsave(&avc_lock,flags);
-
+	rcu_read_lock();
 	for (i = 0; i < AVC_CACHE_SLOTS; i++) {
-		node = avc_cache.slots[i];
-		while (node) {
-			tmp = node;
-			node = node->next;
-			tmp->ae.ssid = tmp->ae.tsid = SECSID_NULL;
-			tmp->ae.tclass = SECCLASS_NULL;
-			tmp->ae.avd.allowed = tmp->ae.avd.decided = 0;
-			tmp->ae.avd.auditallow = tmp->ae.avd.auditdeny = 0;
-			tmp->ae.used = 0;
-			tmp->next = avc_node_freelist;
-			avc_node_freelist = tmp;
-			avc_cache.active_nodes--;
+		spin_lock_irqsave(&avc_cache.slots_lock[i], flag);
+		list_for_each_entry_rcu(node, &avc_cache.slots[i], list){
+			list_del_rcu(&node->list);
+			call_rcu(&node->rhead, avc_node_free);
 		}
-		avc_cache.slots[i] = NULL;
+		spin_unlock_irqrestore(&avc_cache.slots_lock[i], flag);
 	}
-	avc_cache.lru_hint = 0;
-
-	spin_unlock_irqrestore(&avc_lock,flags);
+	rcu_read_unlock();
 
 	for (i = 0; i < AVC_NSTATS; i++)
 		avc_cache_stats[i] = 0;
 
@@ -895,12 +934,9 @@
 				goto out;
 		}
 	}
 
-	spin_lock_irqsave(&avc_lock,flags);
-	if (seqno > avc_cache.latest_notif)
-		avc_cache.latest_notif = seqno;
-	spin_unlock_irqrestore(&avc_lock,flags);
+	avc_latest_notif_update(seqno, 0);
 out:
 	return rc;
 }
 
@@ -949,9 +985,9 @@
  * @ssid: source security identifier
  * @tsid: target security identifier
  * @tclass: target security class
  * @requested: requested permissions, interpreted based on @tclass
- * @aeref:  AVC entry reference
+ * @aeref:  AVC entry reference(not in use)
  * @avd: access vector decisions
  *
  * Check the AVC to determine whether the @requested permissions are granted
  * for the SID pair (@ssid, @tsid), interpreting the permissions
@@ -968,72 +1004,46 @@
 int avc_has_perm_noaudit(u32 ssid, u32 tsid,
                          u16 tclass, u32 requested,
                          struct avc_entry_ref *aeref, struct av_decision *avd)
 {
-	struct avc_entry *ae;
-	int rc = 0;
-	unsigned long flags;
+	struct avc_node *node;
 	struct avc_entry entry;
+	int rc = 0;
 	u32 denied;
-	struct avc_entry_ref ref;
 
-	if (!aeref) {
-		avc_entry_ref_init(&ref);
-		aeref = &ref;
-	}
-
-	spin_lock_irqsave(&avc_lock, flags);
+	rcu_read_lock();
 	avc_cache_stats_incr(AVC_ENTRY_LOOKUPS);
-	ae = aeref->ae;
-	if (ae) {
-		if (ae->ssid == ssid &&
-		    ae->tsid == tsid &&
-		    ae->tclass == tclass &&
-		    ((ae->avd.decided & requested) == requested)) {
-			avc_cache_stats_incr(AVC_ENTRY_HITS);
-			ae->used = 1;
-		} else {
-			avc_cache_stats_incr(AVC_ENTRY_DISCARDS);
-			ae = NULL;
-		}
-	}
 
-	if (!ae) {
-		avc_cache_stats_incr(AVC_ENTRY_MISSES);
-		rc = avc_lookup(ssid, tsid, tclass, requested, aeref);
-		if (rc) {
-			spin_unlock_irqrestore(&avc_lock,flags);
-			rc = security_compute_av(ssid,tsid,tclass,requested,&entry.avd);
-			if (rc)
-				goto out;
-			spin_lock_irqsave(&avc_lock, flags);
-			rc = avc_insert(ssid,tsid,tclass,&entry,aeref);
-			if (rc) {
-				spin_unlock_irqrestore(&avc_lock,flags);
-				goto out;
-			}
+	node = avc_lookup(ssid, tsid, tclass, requested);
+	if (!node) {
+		rcu_read_unlock();
+		rc = security_compute_av(ssid,tsid,tclass,requested,&entry.avd);
+		if (rc)
+			goto out;
+		rcu_read_lock();
+		node = avc_insert(ssid,tsid,tclass,&entry);
+		if (!node) {
+			rc = -ENOMEM;
+			rcu_read_unlock();
+			goto out;
 		}
-		ae = aeref->ae;
 	}
-
 	if (avd)
-		memcpy(avd, &ae->avd, sizeof(*avd));
+		memcpy(avd, &node->ae.avd, sizeof(*avd));
 
-	denied = requested & ~(ae->avd.allowed);
+	denied = requested & ~(node->ae.avd.allowed);
 
 	if (!requested || denied) {
 		if (selinux_enforcing) {
-			spin_unlock_irqrestore(&avc_lock,flags);
 			rc = -EACCES;
-			goto out;
 		} else {
-			ae->avd.allowed |= requested;
-			spin_unlock_irqrestore(&avc_lock,flags);
-			goto out;
+			if (node->ae.avd.allowed != (node->ae.avd.allowed|requested))
+				avc_update_node(AVC_CALLBACK_GRANT
+				                ,requested,ssid,tsid,tclass);
 		}
 	}
 
-	spin_unlock_irqrestore(&avc_lock,flags);
+	rcu_read_unlock();
 out:
 	return rc;
 }
 
@@ -1042,9 +1052,9 @@
  * @ssid: source security identifier
  * @tsid: target security identifier
  * @tclass: target security class
  * @requested: requested permissions, interpreted based on @tclass
- * @aeref:  AVC entry reference
+ * @aeref:  AVC entry reference(not in use)
  * @auditdata: auxiliary audit data
  *
  * Check the AVC to determine whether the @requested permissions are granted
  * for the SID pair (@ssid, @tsid), interpreting the permissions
@@ -1061,8 +1071,8 @@
 {
 	struct av_decision avd;
 	int rc;
 
-	rc = avc_has_perm_noaudit(ssid, tsid, tclass, requested, aeref, &avd);
+	rc = avc_has_perm_noaudit(ssid, tsid, tclass, requested, NULL, &avd);
 	avc_audit(ssid, tsid, tclass, requested, &avd, rc, auditdata);
 	return rc;
 }
diff -rNU4 linux-2.6.8.1/security/selinux/include/avc.h linux-2.6.8.1.rcu/security/selinux/include/avc.h
--- linux-2.6.8.1/security/selinux/include/avc.h	2004-08-14 19:54:51.000000000 +0900
+++ linux-2.6.8.1.rcu/security/selinux/include/avc.h	2004-08-20 20:40:50.000000000 +0900
@@ -118,13 +118,11 @@
  */
 
 void __init avc_init(void);
 
-int avc_lookup(u32 ssid, u32 tsid, u16 tclass,
-               u32 requested, struct avc_entry_ref *aeref);
+struct avc_node *avc_lookup(u32 ssid, u32 tsid, u16 tclass, u32 requested);
 
-int avc_insert(u32 ssid, u32 tsid, u16 tclass,
-               struct avc_entry *ae, struct avc_entry_ref *out_aeref);
+struct avc_node *avc_insert(u32 ssid, u32 tsid, u16 tclass, struct avc_entry *ae);
 
 void avc_audit(u32 ssid, u32 tsid,
                u16 tclass, u32 requested,
                struct av_decision *avd, int result, struct avc_audit_data *auditdata);

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RCU issue with SELinux (Re: SELINUX performance issues)
  2004-08-24 23:02         ` RCU issue with SELinux (Re: SELINUX performance issues) Paul E. McKenney
@ 2004-08-25  9:51           ` Kaigai Kohei
  2004-08-25 17:34             ` Paul E. McKenney
  0 siblings, 1 reply; 31+ messages in thread
From: Kaigai Kohei @ 2004-08-25  9:51 UTC (permalink / raw)
  To: paulmck
  Cc: Stephen Smalley, SELinux-ML(Eng), Linux Kernel ML(Eng), James Morris

Hi Paul, thanks for your comments.

> > I modified the following points:
> > - We hold the lock for hash backet when avc_insert() and avc_ss_reset() are
> >   called for safety.
> > - list_for_each_rcu() and list_entry() are replaced by list_for_entry().
> 
> One subtlety here...
> 
> The traversals that are protected by rcu_read_lock() (rather than an
> update-side spinlock) need to be list_for_each_entry_rcu() rather than
> list_for_each_entry().  The "_rcu()" is required in order to work
> reliably on Alpha, and has the added benefit of calling out exactly
> which traversals are RCU-protected.
> 
> Update-side code remains list_for_each_entry().

It was a simple misconception.
I fixed them in the take3-patch.

> > - avc_node_dual structure which contains two avc_node objects is defined. 
> >   It allows to do avc_update_node() without kmalloc() or any locks.
> 
> What happens when you have two consecutive updates to the same object?
> Don't you have to defer the second update until a grace period has
> elapsed since the first update in order to avoid confusing readers that
> are still accessing the original version?

I didn't imagine such a situation. Indeed, such thing may happen.

> One way to do this would be to set a "don't-touch-me" bit that is
> cleared by an RCU callback.  An update to an element with the
> "don't-touch-me" bit set would block until the bit clears.  There
> are probably better ways...

I think we can't apply this approach for the implementation
of avc_update_node(), because execution context isn't permitted to block.

I changed my opinion and implementation of avc_update_node().
If kmalloc() returns NULL in avc_update_node(), it returns -ENOMEM.

But this effect of changing the prototype is limited, because only
avc_has_perm_noaudit() and avc_update_cache() call avc_update_node().

Even if avc_update_node() return -ENOMEM to avc_has_perm_noaudit(),
avc_has_perm_noaudit() can ignore it, because the purpose is only
to control the audit-log floods.
This adverse effect is only that audit-logs are printed twice.

Nobody calls avc_update_cache(), which is only defined.

Some other trivial fixes are as follows:
- All list_for_each_entry() were replaced by list_for_each_entry_rcu().
- All spin_lock()/spin_unlock() were replaced by spin_lock_irqsave()
  /spin_unlock_restore().
- In avc_node_insert(), if an entry with the same ssid/tsid/tclass as new
  one exists, the older entry is replaced by the new one.

Thank you for the opinion as a specialist of RCU!
--------
Kai Gai <kaigai@ak.jp.nec.com>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RCU issue with SELinux (Re: SELINUX performance issues)
  2004-08-24 13:24         ` James Morris
@ 2004-08-25  9:51           ` Kaigai Kohei
  2004-08-25 18:31             ` James Morris
  2004-08-25  9:52           ` [PATCH]atomic_inc_return() for i386/x86_64 (Re: RCU issue with SELinux) Kaigai Kohei
  1 sibling, 1 reply; 31+ messages in thread
From: Kaigai Kohei @ 2004-08-25  9:51 UTC (permalink / raw)
  To: James Morris; +Cc: Stephen Smalley, SELinux-ML(Eng), Linux Kernel ML(Eng)

Hi James, thanks for your comment.

> > o dbench [ 4 processes run parallely on 4-CPUs / 10 times trials ]
> >                   ---- mean ----  - STD -
> > 2.6.8.1(disable)  860.249 [MB/s]   44.683
> > 2.6.8.1(enable)   714.254 [MB/s]   32.359
> > 2.6.8.1(+rwlock)  767.904 [MB/s]   27.968
> > 2.6.8.1(+RCU)     830.678 [MB/s]   16.352
> 
> Can you show the figures for 1 and 2 clients?

The results are as follows:
o dbench [ 1/2/4 processes run parallely on 4-CPUs / 10 times trials ]
- Average[MB/s] -  -1proc- -2procs- -4procs-
2.6.8.1(disable)    247.43   473.11   891.51
2.6.8.1(enable)     218.99   434.97   761.06
2.6.8.1(+rwlock)    225.18   432.80   802.62
2.6.8.1(+RCU)       231.84   444.30   820.00
--------------------------------------------
(*) 2.6.8.1(+RCU) is applied the take3-patch.

By the way, the results of dbench are sharply changed at every measurement.
I don't think it is statistically-significant.

--------
Kai Gai <kaigai@ak.jp.nec.com>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH]atomic_inc_return() for i386/x86_64 (Re: RCU issue with SELinux)
  2004-08-24 13:24         ` James Morris
  2004-08-25  9:51           ` Kaigai Kohei
@ 2004-08-25  9:52           ` Kaigai Kohei
  1 sibling, 0 replies; 31+ messages in thread
From: Kaigai Kohei @ 2004-08-25  9:52 UTC (permalink / raw)
  To: James Morris; +Cc: Stephen Smalley, SELinux-ML(Eng), Linux Kernel ML(Eng)

[-- Attachment #1: Type: text/plain, Size: 963 bytes --]

Hi James, thanks for your comment.

> > In IA-32 or x86_64, can anybady implement atomic_inc_return()?
> > If it can not, I'll try to make alternative macros or inline functions.
>
> If you can get this done, it will be very useful, as I could allso run
> some benchmarks on my test systems.

atomic_inc_return() is not defined for arm,arm26,i386,x86_64 and um archtectures.
This attached patch adds atomic_inc_return() and atomic_dec_return() to arm,i386 and x86_64.

It is implemented by 'xaddl' operation with LOCK prefix for i386 and x86_64.
But this operation is permitted after i486 processor only.
Another implementation may be necessary for i386SX/DX processor.
But 'xaddl' operation is used in 'include/asm-i386/rwsem.h' unconditionally.
I think it has agreed on using 'xaddl' operation in past days.

Is it acceptable? Any comment please.

(*) The implementation for ARM is simple wrapper to atomic_add_return().
--------
Kai Gai <kaigai@ak.jp.nec.com>

[-- Attachment #2: atomic_inc_return-2.6.8.1.patch --]
[-- Type: application/octet-stream, Size: 2739 bytes --]

--- linux-2.6.8.1/include/asm-arm/atomic.h	2004-08-14 19:54:50.000000000 +0900
+++ linux-2.6.8.1.rcu/include/asm-arm/atomic.h	2004-08-24 19:31:56.000000000 +0900
@@ -194,8 +194,10 @@
 #define atomic_dec(v)		atomic_sub(1, v)
 
 #define atomic_inc_and_test(v)	(atomic_add_return(1, v) == 0)
 #define atomic_dec_and_test(v)	(atomic_sub_return(1, v) == 0)
+#define atomic_inc_return(v)    (atomic_add_return(1, v))
+#define atomic_dec_return(v)    (atomic_sub_return(1, v))
 
 #define atomic_add_negative(i,v) (atomic_add_return(i, v) < 0)
 
 /* Atomic operations are already serializing on ARM */
--- linux-2.6.8.1/include/asm-i386/atomic.h	2004-08-14 19:55:09.000000000 +0900
+++ linux-2.6.8.1.rcu/include/asm-i386/atomic.h	2004-08-25 11:57:08.000000000 +0900
@@ -175,8 +175,34 @@
 		:"ir" (i), "m" (v->counter) : "memory");
 	return c;
 }
 
+/**
+ * atomic_add_return - add and return
+ * @v: pointer of type atomic_t
+ * @i: integer value to add
+ *
+ * Atomically adds @i to @v and returns @i + @v
+ */
+static __inline__ int atomic_add_return(int i, atomic_t *v)
+{
+	/* NOTICE: This function can be used on i486+ */
+	int __i = i;
+	__asm__ __volatile__(
+		LOCK "xaddl %0, %1;"
+		:"=r"(i)
+		:"m"(v->counter), "0"(i));
+	return i + __i;
+}
+
+static __inline__ int atomic_sub_return(int i, atomic_t *v)
+{
+	return atomic_add_return(-i,v);
+}
+
+#define atomic_inc_return(v)  (atomic_add_return(1,v))
+#define atomic_dec_return(v)  (atomic_sub_return(1,v))
+
 /* These are x86-specific, used by some header files */
 #define atomic_clear_mask(mask, addr) \
 __asm__ __volatile__(LOCK "andl %0,%1" \
 : : "r" (~(mask)),"m" (*addr) : "memory")
--- linux-2.6.8.1/include/asm-x86_64/atomic.h	2004-08-14 19:56:23.000000000 +0900
+++ linux-2.6.8.1.rcu/include/asm-x86_64/atomic.h	2004-08-25 11:57:36.000000000 +0900
@@ -177,8 +177,33 @@
 		:"ir" (i), "m" (v->counter) : "memory");
 	return c;
 }
 
+/**
+ * atomic_add_return - add and return
+ * @v: pointer of type atomic_t
+ * @i: integer value to add
+ *
+ * Atomically adds @i to @v and returns @i + @v
+ */
+static __inline__ int atomic_add_return(int i, atomic_t *v)
+{
+	int __i = i;
+	__asm__ __volatile__(
+		LOCK "xaddl %0, %1;"
+		:"=r"(i)
+		:"m"(v->counter), "0"(i));
+	return i + __i;
+}
+
+static __inline__ int atomic_sub_return(int i, atomic_t *v)
+{
+	return atomic_add_return(-i,v);
+}
+
+#define atomic_inc_return(v)  (atomic_add_return(1,v))
+#define atomic_dec_return(v)  (atomic_sub_return(1,v))
+
 /* These are x86-specific, used by some header files */
 #define atomic_clear_mask(mask, addr) \
 __asm__ __volatile__(LOCK "andl %0,%1" \
 : : "r" (~(mask)),"m" (*addr) : "memory")

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RCU issue with SELinux (Re: SELINUX performance issues)
  2004-08-25  9:51           ` Kaigai Kohei
@ 2004-08-25 15:50             ` Stephen Smalley
  2004-08-25 16:11               ` Stephen Smalley
  2004-08-26  7:53               ` Kaigai Kohei
  0 siblings, 2 replies; 31+ messages in thread
From: Stephen Smalley @ 2004-08-25 15:50 UTC (permalink / raw)
  To: Kaigai Kohei; +Cc: SELinux-ML(Eng), Linux Kernel ML(Eng), James Morris

On Wed, 2004-08-25 at 05:51, Kaigai Kohei wrote:
> The attached take3-patch is modified as follows:
> - avc_node_dual was eliminated by Paul E.McKenny's suggestion.
>   avc_update_node() calls kmalloc() and may return -ENOMEM.
>   (But, I think this effect is so limited.)
> - All list_for_each_entry() were replaced by list_for_each_entry_rcu().
> - All spin_lock()/spin_unlock() were replaced by spin_lock_irqsave()
>   /spin_unlock_restore().
> - In avc_node_insert(), if an entry with the same ssid/tsid/tclass as new
>   one exists, the older entry is replaced by the new one.
> 
> Thanks. I want to make it the last edition hopefully. :)

I haven't tracked down the cause yet, but a kernel built with all three
patches (list_replace_rcu, atomic_inc_return, and selinux.rcu take3) on
x86 doesn't allow an enforcing boot; it begins auditing denials _before_
the initial policy load (which should never happen, as
security_compute_av allows everything until the policy is loaded), and
prevents /sbin/init from loading the policy.

A few other comments on the patch:

+	new = kmalloc(sizeof(struct avc_node), GFP_ATOMIC);
+	if (!new)
+		return NULL;

Dynamically allocating the nodes at runtime (rather than pre-allocating
them and then just reclaiming them as necessary as in the current AVC)
worries me, as it introduces a new failure case for avc_has_perm. 
Denying permission to a resource due to transient memory shortage is not
good for robustness.  And changing the GFP_ATOMIC is not an option, as
calling context may not allow blocking.  Hence, pre-allocation seems
desirable, regardless of the locking scheme.

+static int avc_latest_notif_update(int seqno, int is_insert)
+{
+	int ret = 0;
+	static spinlock_t notif_lock = SPIN_LOCK_UNLOCKED;
+	unsigned long flag;
+
+	spin_lock_irqsave(&notif_lock, flag);
+	if (seqno < avc_cache.latest_notif) {
+		if (is_insert) {
+			printk(KERN_WARNING "avc:  seqno %d < latest_notif %d\n",
+			       seqno, avc_cache.latest_notif);
+			ret = -EAGAIN;
+		} else {
+			avc_cache.latest_notif = seqno;
+		}
+	}
+	spin_unlock_irqrestore(&notif_lock, flag);
+	return ret;
 }

In trying to merge the logic related to latest_notif, you've introduced
a bug - latest_notif should only be increased, never decreased.  See the
original logic from avc_control and avc_ss_reset prior to your patch. 
Those functions update the latest notif based on a policy change event. 
In the insert case, you are checking that the entry is not stale, i.e.
has a smaller seqno than the latest notification due to an interleaving
with a policy change event.
 
+			if (node->ae.avd.allowed != (node->ae.avd.allowed|requested))
+				avc_update_node(AVC_CALLBACK_GRANT
+				                ,requested,ssid,tsid,tclass);
 		}

The test seems unnecessary, as the function has already determined that
not all of the requested permissions were granted, so you should be able
to just unconditionally call avc_update_node here, and you only need to
pass it the denied set that has already been computed, as any other
permissions in requested were already allowed.

-- 
Stephen Smalley <sds@epoch.ncsc.mil>
National Security Agency


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RCU issue with SELinux (Re: SELINUX performance issues)
  2004-08-25 15:50             ` Stephen Smalley
@ 2004-08-25 16:11               ` Stephen Smalley
  2004-08-26  7:53               ` Kaigai Kohei
  1 sibling, 0 replies; 31+ messages in thread
From: Stephen Smalley @ 2004-08-25 16:11 UTC (permalink / raw)
  To: Kaigai Kohei; +Cc: SELinux-ML(Eng), Linux Kernel ML(Eng), James Morris

On Wed, 2004-08-25 at 11:50, Stephen Smalley wrote:
> I haven't tracked down the cause yet, but a kernel built with all three
> patches (list_replace_rcu, atomic_inc_return, and selinux.rcu take3) on
> x86 doesn't allow an enforcing boot; it begins auditing denials _before_
> the initial policy load (which should never happen, as
> security_compute_av allows everything until the policy is loaded), and
> prevents /sbin/init from loading the policy.

-int avc_lookup(u32 ssid, u32 tsid, u16 tclass,
-               u32 requested, struct avc_entry_ref *aeref)
+struct avc_node *avc_lookup(u32 ssid, u32 tsid, u16 tclass, u32 requested)
 {
 	struct avc_node *node;
-	int probes, rc = 0;
+	int probes;
 
 	avc_cache_stats_incr(AVC_CAV_LOOKUPS);
 	node = avc_search_node(ssid, tsid, tclass,&probes);
 
 	if (node && ((node->ae.avd.decided & requested) == requested)) {
 		avc_cache_stats_incr(AVC_CAV_HITS);
 		avc_cache_stats_add(AVC_CAV_PROBES,probes);
-		aeref->ae = &node->ae;
 		goto out;
 	}
 
 	avc_cache_stats_incr(AVC_CAV_MISSES);
-	rc = -ENOENT;
 out:
-	return rc;
+	return node;
+}

Ah, I think the bug is here.  avc_search_node() can return a node that
matches the (ssid,tsid,tclass) triple but doesn't include all of the
necessary decisions (the decided vector), but your avc_lookup() code
falls through nonetheless and returns it.  This happens normally during
initialization prior to policy load, where security_compute_av is only
adding decisions as they are requested.  Notice that the original code
set rc to ENOENT on that path; in your case, you need to reset node to
NULL.

-- 
Stephen Smalley <sds@epoch.ncsc.mil>
National Security Agency


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RCU issue with SELinux (Re: SELINUX performance issues)
  2004-08-25  9:51           ` Kaigai Kohei
@ 2004-08-25 17:34             ` Paul E. McKenney
  0 siblings, 0 replies; 31+ messages in thread
From: Paul E. McKenney @ 2004-08-25 17:34 UTC (permalink / raw)
  To: Kaigai Kohei
  Cc: Stephen Smalley, SELinux-ML(Eng), Linux Kernel ML(Eng), James Morris

On Wed, Aug 25, 2004 at 06:51:53PM +0900, Kaigai Kohei wrote:

Hello again, Kai,

> Hi Paul, thanks for your comments.
> 
> > > I modified the following points:
> > > - We hold the lock for hash backet when avc_insert() and avc_ss_reset() are
> > >   called for safety.
> > > - list_for_each_rcu() and list_entry() are replaced by list_for_entry().
> > 
> > One subtlety here...
> > 
> > The traversals that are protected by rcu_read_lock() (rather than an
> > update-side spinlock) need to be list_for_each_entry_rcu() rather than
> > list_for_each_entry().  The "_rcu()" is required in order to work
> > reliably on Alpha, and has the added benefit of calling out exactly
> > which traversals are RCU-protected.
> > 
> > Update-side code remains list_for_each_entry().
> 
> It was a simple misconception.
> I fixed them in the take3-patch.

Let's take them one at a time:

1.	The list_for_each_entry_rcu() in avc_hash_eval() is correct.
	It is protected by rcu_read_lock(), so it is documenting
	a RCU-protected traversal of the list, and inserting needed
	memory barriers on Alpha CPUs.

2.	The list_for_each_entry_rcu() in avc_reclaim_node() should
	instead by list_for_each_entry(), with no _rcu().  This is
	because this traversal is guarded by the update-side lock,
	not by RCU.  The fact that the lock is held means that no
	other CPU can be changing this list during the traversal.

3.	The list_for_each_entry_rcu() in avc_search_node() is correct.
	As in item #1, it is protected by rcu_read_lock().

4.	The list_for_each_entry_rcu() in avc_insert() is more interesting,
	but needs to be list_for_each_entry().  It is protected by -both-
	the spinlock and by an rcu_read_lock() acquired by the caller!

	However, the spinlock prevents any other CPU from modifying the
	data structure, so this should be list_for_each_entry() rather
	than the current list_for_each_entry_rcu().

	What is happening is that the code is upgrading from a read-side
	rcu_read_lock() to an update-side spin_lock_irqsave().  This
	is legal, though perhaps a bit unconventional.

	However, looking closer, I don't understand why the rcu_read_lock()
	needs to be re-acquired before the call to avc_insert() in
	avc_has_perm_noaudit():

	o	avc_latest_notif_update() allocates a sequence number
		under a lock.

	o	avc_get_node() allocates and initializes a node.  It does
		call avc_reclaim_node(), but this function is guarded
		by a spinlock.

	o	avc_insert() calls the above two functions, and other
		than that, initializes the new node (which no other CPU
		has access to), and inserts it under the spinlock.

	So I do not believe that avc_insert() needs rcu_read_lock().
	Unless I am missing something, the rcu_read_lock() acquired
	in avc_has_perm_noaudit() should be moved after the call to
	avc_insert().

	I was concerned about the possibility that avc_has_perm_noaudit()
	might be preempted between the rcu_read_unlock() and the
	rcu_read_lock(), but this issue appears to be correctly handled
	by the code in avc_insert(), which handles either case, updating
	the existing node or inserting a new one.

5.	The list_for_each_entry_rcu() in avc_update_node() should be
	changed to list_for_each_entry(), since it is protected
	by the update-side spinlock, and therefore does not need
	RCU protection, as in item #2 above.  This is another case
	where an rcu_read_lock() is "upgraded" to a write-side lock
	by acquiring a spinlock.

6.	The list_for_each_entry_rcu() in avc_update_cache() is correct.
	It is protected only by rcu_read_lock(), so RCU protection is
	required.

7.	The list_for_each_entry_rcu() in avc_ss_reset() needs to be
	changed to list_for_each_entry(), since it is protected by
	the update-side spinlock.  Since the array is statically allocated,
	there is no need for the enclosing rcu_read_lock() and
	rcu_read_unlock(), and they should be removed.

I clearly need to provide more documentation.  ;-)  This is in progress.

See below for a patch that applies on top of your take3 patch, since
the C code might be more clear than the English in this case.

> > > - avc_node_dual structure which contains two avc_node objects is defined. 
> > >   It allows to do avc_update_node() without kmalloc() or any locks.
> > 
> > What happens when you have two consecutive updates to the same object?
> > Don't you have to defer the second update until a grace period has
> > elapsed since the first update in order to avoid confusing readers that
> > are still accessing the original version?
> 
> I didn't imagine such a situation. Indeed, such thing may happen.
> 
> > One way to do this would be to set a "don't-touch-me" bit that is
> > cleared by an RCU callback.  An update to an element with the
> > "don't-touch-me" bit set would block until the bit clears.  There
> > are probably better ways...
> 
> I think we can't apply this approach for the implementation
> of avc_update_node(), because execution context isn't permitted to block.

That would certainly invalidate my suggestion!  ;-)

> I changed my opinion and implementation of avc_update_node().
> If kmalloc() returns NULL in avc_update_node(), it returns -ENOMEM.
> 
> But this effect of changing the prototype is limited, because only
> avc_has_perm_noaudit() and avc_update_cache() call avc_update_node().
> 
> Even if avc_update_node() return -ENOMEM to avc_has_perm_noaudit(),
> avc_has_perm_noaudit() can ignore it, because the purpose is only
> to control the audit-log floods.
> This adverse effect is only that audit-logs are printed twice.
> 
> Nobody calls avc_update_cache(), which is only defined.

Sounds good!

> Some other trivial fixes are as follows:
> - All list_for_each_entry() were replaced by list_for_each_entry_rcu().

See above.  ;-)

> - All spin_lock()/spin_unlock() were replaced by spin_lock_irqsave()
>   /spin_unlock_restore().
> - In avc_node_insert(), if an entry with the same ssid/tsid/tclass as new
>   one exists, the older entry is replaced by the new one.
> 
> Thank you for the opinion as a specialist of RCU!

Thank -you- for giving RCU a try!  Again, the performance results are
quite impressive!

						Thanx, Paul


diff -urpN -X dontdiff linux-2.6.8.1-selinux3.rcu/security/selinux/avc.c linux-2.6.8.1-selinux3.rcu.pem/security/selinux/avc.c
--- linux-2.6.8.1-selinux3.rcu/security/selinux/avc.c	Wed Aug 25 09:34:26 2004
+++ linux-2.6.8.1-selinux3.rcu.pem/security/selinux/avc.c	Wed Aug 25 10:30:10 2004
@@ -253,7 +253,7 @@ static inline int avc_reclaim_node(void)
 		if (!spin_trylock(&avc_cache.slots_lock[hvalue]))
 			continue;
 
-		list_for_each_entry_rcu(node, &avc_cache.slots[hvalue], list) {
+		list_for_each_entry(node, &avc_cache.slots[hvalue], list) {
 			if (!atomic_dec_and_test(&node->ae.used)) {
 				/* Recently Unused */
 				list_del_rcu(&node->list);
@@ -419,7 +419,7 @@ struct avc_node *avc_insert(u32 ssid, u3
 		node->ae.avd.seqno = ae->avd.seqno;
 
 		spin_lock_irqsave(&avc_cache.slots_lock[hvalue], flag);
-		list_for_each_entry_rcu(pos, &avc_cache.slots[hvalue], list) {
+		list_for_each_entry(pos, &avc_cache.slots[hvalue], list) {
 			if (pos->ae.ssid == ssid &&
 			    pos->ae.tsid == tsid &&
 			    pos->ae.tclass == tclass) {
@@ -719,7 +719,7 @@ static int avc_update_node(u32 event, u3
 	hvalue = avc_hash(ssid, tsid, tclass);
 	spin_lock_irqsave(&avc_cache.slots_lock[hvalue], flag);
 
-	list_for_each_entry_rcu(pos, &avc_cache.slots[hvalue], list){
+	list_for_each_entry(pos, &avc_cache.slots[hvalue], list){
 		if ( ssid==pos->ae.ssid &&
 		     tsid==pos->ae.tsid &&
 		     tclass==pos->ae.tclass ){
@@ -912,16 +912,14 @@ int avc_ss_reset(u32 seqno)
 
 	avc_hash_eval("reset");
 
-	rcu_read_lock();
 	for (i = 0; i < AVC_CACHE_SLOTS; i++) {
 		spin_lock_irqsave(&avc_cache.slots_lock[i], flag);
-		list_for_each_entry_rcu(node, &avc_cache.slots[i], list){
+		list_for_each_entry(node, &avc_cache.slots[i], list){
 			list_del_rcu(&node->list);
 			call_rcu(&node->rhead, avc_node_free);
 		}
 		spin_unlock_irqrestore(&avc_cache.slots_lock[i], flag);
 	}
-	rcu_read_unlock();
 
 	for (i = 0; i < AVC_NSTATS; i++)
 		avc_cache_stats[i] = 0;

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RCU issue with SELinux (Re: SELINUX performance issues)
  2004-08-25  9:51           ` Kaigai Kohei
@ 2004-08-25 18:31             ` James Morris
  0 siblings, 0 replies; 31+ messages in thread
From: James Morris @ 2004-08-25 18:31 UTC (permalink / raw)
  To: Kaigai Kohei; +Cc: Stephen Smalley, SELinux-ML(Eng), Linux Kernel ML(Eng)

On Wed, 25 Aug 2004, Kaigai Kohei wrote:

> By the way, the results of dbench are sharply changed at every measurement.
> I don't think it is statistically-significant.

Try running dbench on an ext2 filesystem (including both the 'client' 
file and where you run it locally).


- James
-- 
James Morris
<jmorris@redhat.com>



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RCU issue with SELinux (Re: SELINUX performance issues)
  2004-08-25 15:50             ` Stephen Smalley
  2004-08-25 16:11               ` Stephen Smalley
@ 2004-08-26  7:53               ` Kaigai Kohei
  2004-08-26 13:24                 ` Stephen Smalley
  1 sibling, 1 reply; 31+ messages in thread
From: Kaigai Kohei @ 2004-08-26  7:53 UTC (permalink / raw)
  To: Stephen Smalley; +Cc: SELinux-ML(Eng), Linux Kernel ML(Eng), James Morris

Hi Stephen, thanks for your comments.

> A few other comments on the patch:
> 
> + new = kmalloc(sizeof(struct avc_node), GFP_ATOMIC);
> + if (!new)
> + return NULL;
> 
> Dynamically allocating the nodes at runtime (rather than pre-allocating
> them and then just reclaiming them as necessary as in the current AVC)
> worries me, as it introduces a new failure case for avc_has_perm. 
> Denying permission to a resource due to transient memory shortage is not
> good for robustness.  And changing the GFP_ATOMIC is not an option, as
> calling context may not allow blocking.  Hence, pre-allocation seems
> desirable, regardless of the locking scheme.

In my understanding, your worry about robustness is the execution path
when kmalloc() returns NULL.
The most significant point is there.
--[ in the original kernel-2.6.8.1 ]--
    :
  rc = avc_insert(ssid,tsid,tclass,&entry,aeref);
  if (rc) {
    spin_unlock_irqrestore(&avc_lock,flags);
    goto out;
  }
    :
--------------------------------------
In the original implementation, avc_insert() returns -ENOMEM when we can't
hold a avc_entry by avc_claim_node(), then avc_has_perm_noaudit() denies
the requested permition check by the reason an avc_node is not allocated.
(But avc_insert() always returns 0, because avc_insert() reclaim a avc_node
 under the spinlock when free_list is empty.)
However, allocating an avc_node and making the decision according to
the policy of SELinux should be separated in considering.

In my approach with RCU, avc_insert() can actually return NULL.
Currently, avc_has_perm_noaudit() returns -ENOMEM without the decision-making
like as the original implementation.
But we can make an decision according to the policy of SELinux by a suitable
recover processing.

My proposal is as follows:
- Normaly, decision-making is performed to the entry on AVC.
  These entries are allocated by kmalloc()
- When kmalloc() called by the extension of avc_insert() returns NULL,
  the decision-making is performed to the entry on stack variable,
  then the entry is not hold on AVC.

-- e.g. in avc_has_perm_noaudit() --------------
    :
  struct avc_entry local_ae, *ae;
    :
  node = avc_insert(....);
  if (!node) {
    SETUP_LOCAL_AE(&local_ae,&entry.avd);
    ae = &local_ae;
  } else {
    ae = &node->ae;
  }
  
  if (avd)
    memcpy(avd, &ae->avd, sizeof(*avd));
    :
------------------------------------------------
By this method, the decision-making is available irrespective of
the result of kmalloc(). Is it robustless?
The original implementation has too many lock contensitons in Big-SMP
environment. It is more positive to consider the method using RCU.

> In trying to merge the logic related to latest_notif, you've introduced
> a bug - latest_notif should only be increased, never decreased.  See the
> original logic from avc_control and avc_ss_reset prior to your patch. 
> Those functions update the latest notif based on a policy change event. 
> In the insert case, you are checking that the entry is not stale, i.e.
> has a smaller seqno than the latest notification due to an interleaving
> with a policy change event.

Ooooooops!
It is toooooo trivial BUG!
I'll fix them very soon.

> + if (node->ae.avd.allowed != (node->ae.avd.allowed|requested))
> + avc_update_node(AVC_CALLBACK_GRANT
> +                 ,requested,ssid,tsid,tclass);
>   }
> 
> The test seems unnecessary, as the function has already determined that
> not all of the requested permissions were granted, so you should be able
> to just unconditionally call avc_update_node here, and you only need to
> pass it the denied set that has already been computed, as any other
> permissions in requested were already allowed.

Indeed, it is right. I'll fix them soon.
avc_update_node() is in the section between rcu_read_lock() and rcu_read_unlock().

> - rc = -ENOENT;
>  out:
> - return rc;
> + return node;
> +}
> 
> Ah, I think the bug is here.

Oops!
I agree, and fix this soon.

Please wait for a patch, thanks.
--------
Kai Gai <kaigai@ak.jp.nec.com>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RCU issue with SELinux (Re: SELINUX performance issues)
  2004-08-26  7:53               ` Kaigai Kohei
@ 2004-08-26 13:24                 ` Stephen Smalley
  2004-08-27 11:07                   ` Kaigai Kohei
  2004-08-30 11:17                   ` [PATCH]SELinux performance improvement by RCU (Re: RCU issue with SELinux) Kaigai Kohei
  0 siblings, 2 replies; 31+ messages in thread
From: Stephen Smalley @ 2004-08-26 13:24 UTC (permalink / raw)
  To: Kaigai Kohei; +Cc: SELinux-ML(Eng), Linux Kernel ML(Eng), James Morris

On Thu, 2004-08-26 at 03:53, Kaigai Kohei wrote:
> In my understanding, your worry about robustness is the execution path
> when kmalloc() returns NULL.

Correct.

> (But avc_insert() always returns 0, because avc_insert() reclaim a avc_node
>  under the spinlock when free_list is empty.)

Yes, this is the point.  avc_has_perm could not fail previously from an
out of memory condition, as the cache nodes were preallocated,
maintained on their own freelist, and reclaimed as needed.

> By this method, the decision-making is available irrespective of
> the result of kmalloc(). Is it robustless?
> The original implementation has too many lock contensitons in Big-SMP
> environment. It is more positive to consider the method using RCU.

Yes, that would address my concern.  However, I'm still unclear as to
why using RCU mandates that we migrate from preallocated nodes to
dynamic allocation.  I certainly agree that the existing global spinlock
doesn't scale.  

> Please wait for a patch, thanks.

Thanks for working on this.  Could you also supply updated performance
data when you have a newer patch?  Thanks.

-- 
Stephen Smalley <sds@epoch.ncsc.mil>
National Security Agency


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: RCU issue with SELinux (Re: SELINUX performance issues)
  2004-08-26 13:24                 ` Stephen Smalley
@ 2004-08-27 11:07                   ` Kaigai Kohei
  2004-08-30 11:17                   ` [PATCH]SELinux performance improvement by RCU (Re: RCU issue with SELinux) Kaigai Kohei
  1 sibling, 0 replies; 31+ messages in thread
From: Kaigai Kohei @ 2004-08-27 11:07 UTC (permalink / raw)
  To: Stephen Smalley; +Cc: SELinux-ML(Eng), Linux Kernel ML(Eng), James Morris

Hi Stephen, thanks for your comments.

> > By this method, the decision-making is available irrespective of
> > the result of kmalloc(). Is it robustless?
> > The original implementation has too many lock contensitons in Big-SMP
> > environment. It is more positive to consider the method using RCU.
> 
> Yes, that would address my concern.  However, I'm still unclear as to
> why using RCU mandates that we migrate from preallocated nodes to
> dynamic allocation.  I certainly agree that the existing global spinlock
> doesn't scale.  

When avc_reclaim_node() is called, one or some nodes are reclaimed
in both approachs(spinlock/RCU).

But we can't use the reclaimed nodes immediately in RCU-approach,
because it can't guarantee that nobody refers the nodes.
(These nodes are released actually after non-deterministic period.)
The success of avc_reclaim_node() does not mean that we can hold
an avc_node object immediately!
Therefore, we need to allocate a new avc_node object by kmalloc().

In original spinlock implementation, the reclaimed node is chained
to the 'avc_node_freelist' under the spinlock. Thus, we can use
the reclaimed node immediately.

Indeed, I had considered the RCU-approach with pre-allocation,
but I faced to the above difficult problem, and gave up.
Then, I make it with kmalloc() alternatively, so fine.

> > Please wait for a patch, thanks.
> 
> Thanks for working on this.  Could you also supply updated performance
> data when you have a newer patch?  Thanks.

OK, the benchmark results will also be updated.
Thanks.
--------
Kai Gai <kaigai@ak.jp.nec.com>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH]SELinux performance improvement by RCU (Re: RCU issue with SELinux)
  2004-08-26 13:24                 ` Stephen Smalley
  2004-08-27 11:07                   ` Kaigai Kohei
@ 2004-08-30 11:17                   ` Kaigai Kohei
  2004-08-30 15:35                     ` Stephen Smalley
  2004-08-31 15:33                     ` James Morris
  1 sibling, 2 replies; 31+ messages in thread
From: Kaigai Kohei @ 2004-08-30 11:17 UTC (permalink / raw)
  To: Stephen Smalley; +Cc: SELinux-ML(Eng), Linux Kernel ML(Eng), James Morris

[-- Attachment #1: Type: text/plain, Size: 4345 bytes --]

Hi Stephen, thanks for your comments.

> > Please wait for a patch, thanks.
> 
> Thanks for working on this.  Could you also supply updated performance
> data when you have a newer patch?  Thanks.

I fixed the take-3 patch according to your and Paul's suggestions.

The attached take-4 patches replace the avc_lock in security/selinux/avc.c
by the lock-less read access with RCU.

The patches consist of three parts as follows:
[1] selinux.rcu-2.6.8.1-take4.patch
     removed the spinlock (avc_lock) from security/selinux/avc.c,
     and applied RCU-approach for the lock-less read access to the AVC.
[2] list_replace_rcu-2.6.8.1.patch
     added list_replace_rcu() to include/linux/list.h for the atomic updating
     operations according to RCU-model.
[3] atomic_inc_return-2.6.8.1.patch
     added atomic_inc_return() and atomic_dec_return() to i386, x86_64 and ARM
     architectures. It's necessary for the [1] patch.


[Changes by these patches from original implementation]
- Some of the data structures are re-defined.
  (struct avc_entry/struct avc_node/struct avc_cache)
- The direct references from security-private member in inode, task_struct and so on
  were removed, since these references prevent from applying RCU.
  Currently, avc_entry_ref is not used.
- The pre-allocation semantics of avc_node structure was replaced by kmalloc()
  on demand, since it's necessary for applying RCU.
  (The number of nodes is adjusted by AVC_CACHE_THRESHOLD, but it's not a hard limit.)
- avc_update_node() can return -ENOMEM when kmalloc() returns NULL.
  But this is not a significant problem, since the function is actually called
  by only avc_has_perm_noaudit() for the control of the flood of audit-logs.

Signed-off-by: KaiGai, Kohei <kaigai@ak.jp.nec.com>
Signed-off-by: Takayoshi Kochi <t-kochi@bq.jp.nec.com>

Thanks for all the comments and suggestions.


[Performance & Scalability results] --------------------------------------

2.6.8.1(disable/enable) : Original Stock Kernel
2.6.8.1.rwlock          : Improvement by rwlock(not disclosed)
2.6.8.1.rcu             : Improvement by RCU(attached take-4 patch)

o 500,000 times of write() syscall to the file on /dev/shm
 on Itanium2 x 1/2/4/8/16/32 + enough memory(needless to swap-out)
 *) The num of Processes equals the num of CPUs.
---------------- --1CPU--  --2CPU-- --4CPU-- -- 8CPU-- --16CPU-- --32CPU--
2.6.8.1(disable)   8.2959    8.0430   8.0158    8.0183    8.0146    8.0037
2.6.8.1(enable)   11.8427   14.0358  78.0957  319.0451 1322.0313  too long
2.6.8.1.rwlock    11.2485   13.8688  20.0100   49.0390   90.0200  177.0533
2.6.8.1.rcu       11.3754   11.2028  11.1743   11.1402   11.1216   11.2635
---------------- --------  -------- -------- --------- --------- ---------

o dbench (10 times mean results)
- Average[MB/s] -  -1proc- -2procs- -4procs-
2.6.8.1(disable)    247.43   473.11   891.51
2.6.8.1(enable)     218.99   434.97   761.06
2.6.8.1(+rwlock)    225.18   432.80   802.62
2.6.8.1(+RCU)       231.16   447.62   838.15
--------------------------------------------

o UNIX-bench

* INDEX value comparison ---------------------------------------------------
                                       2.6.8.1   2.6.8.1   2.6.8.1   2.6.8.1
                                     (Disable)  (Enable)  (rwlock)    (RCU) 
Dhrystone 2 using register variables    268.9     268.8     269.2     268.7 
Double-Precision Whetstone               94.2      94.2      94.2      94.2 
Execl Throughput                        388.3     379.0     377.8     377.0 
File Copy 1024 bufsize 2000 maxblocks   606.6     526.6     515.6     513.1 
File Copy 256 bufsize 500 maxblocks     508.9     417.0     410.4     402.9 
File Copy 4096 bufsize 8000 maxblocks   987.1     890.4     876.0     864.1 
Pipe Throughput                         525.1     406.4     404.5     356.2 
Process Creation                        321.2     317.8     315.9     316.5 
Shell Scripts (8 concurrent)           1312.8    1276.2    1278.8    1279.5 
System Call Overhead                    467.1     468.7     464.1     468.1 
                                    ========================================
     FINAL SCORE                        445.8     413.2     410.1     403.8 
----------------------------------------------------------------------------

--------
Kai Gai <kaigai@ak.jp.nec.com>


[-- Attachment #2: selinux.rcu-2.6.8.1-take4.patch --]
[-- Type: application/octet-stream, Size: 21657 bytes --]

diff -rNU4 linux-2.6.8.1/security/selinux/avc.c linux-2.6.8.1.rcu/security/selinux/avc.c
--- linux-2.6.8.1/security/selinux/avc.c	2004-08-14 19:55:48.000000000 +0900
+++ linux-2.6.8.1.rcu/security/selinux/avc.c	2004-08-30 20:00:27.000000000 +0900
@@ -3,8 +3,11 @@
  *
  * Authors:  Stephen Smalley, <sds@epoch.ncsc.mil>
  *           James Morris <jmorris@redhat.com>
  *
+ * Update:   KaiGai, Kohei <kaigai@ak.jp.nec.com>
+ *     Replaced the avc_lock spinlock by RCU.
+ *
  * Copyright (C) 2003 Red Hat, Inc., James Morris <jmorris@redhat.com>
  *
  *	This program is free software; you can redistribute it and/or modify
  *	it under the terms of the GNU General Public License version 2,
@@ -35,28 +38,32 @@
 #include "av_perm_to_string.h"
 #include "objsec.h"
 
 #define AVC_CACHE_SLOTS		512
-#define AVC_CACHE_MAXNODES	410
+#define AVC_CACHE_THRESHOLD	410
+#define AVC_CACHE_RECLAIM	16
 
 struct avc_entry {
 	u32			ssid;
 	u32			tsid;
 	u16			tclass;
 	struct av_decision	avd;
-	int			used;	/* used recently */
+	atomic_t		used;	/* used recently */
 };
 
 struct avc_node {
 	struct avc_entry	ae;
-	struct avc_node		*next;
+	struct list_head	list;
+	struct rcu_head         rhead;
 };
 
 struct avc_cache {
-	struct avc_node	*slots[AVC_CACHE_SLOTS];
-	u32		lru_hint;	/* LRU hint for reclaim scan */
-	u32		active_nodes;
-	u32		latest_notif;	/* latest revocation notification */
+	struct list_head	slots[AVC_CACHE_SLOTS];
+	spinlock_t		slots_lock[AVC_CACHE_SLOTS];
+		/* lock for insert/update/delete and reset */
+	atomic_t		lru_hint;	/* LRU hint for reclaim scan */
+	atomic_t		active_nodes;
+	u32			latest_notif;	/* latest revocation notification */
 };
 
 struct avc_callback_node {
 	int (*callback) (u32 event, u32 ssid, u32 tsid,
@@ -69,10 +76,8 @@
 	u32 perms;
 	struct avc_callback_node *next;
 };
 
-static spinlock_t avc_lock = SPIN_LOCK_UNLOCKED;
-static struct avc_node *avc_node_freelist;
 static struct avc_cache avc_cache;
 static unsigned avc_cache_stats[AVC_NSTATS];
 static struct avc_callback_node *avc_callbacks;
 
@@ -187,23 +192,17 @@
  * Initialize the access vector cache.
  */
 void __init avc_init(void)
 {
-	struct avc_node	*new;
 	int i;
 
-	for (i = 0; i < AVC_CACHE_MAXNODES; i++) {
-		new = kmalloc(sizeof(*new), GFP_ATOMIC);
-		if (!new) {
-			printk(KERN_WARNING "avc:  only able to allocate "
-			       "%d entries\n", i);
-			break;
-		}
-		memset(new, 0, sizeof(*new));
-		new->next = avc_node_freelist;
-		avc_node_freelist = new;
-	}
-
+	for (i =0; i < AVC_CACHE_SLOTS; i++) {
+		INIT_LIST_HEAD(&avc_cache.slots[i]);
+		avc_cache.slots_lock[i] = SPIN_LOCK_UNLOCKED;
+	}
+	atomic_set(&avc_cache.active_nodes, 0);
+	atomic_set(&avc_cache.lru_hint, 0);
+	
 	audit_log(current->audit_context, "AVC INITIALIZED\n");
 }
 
 #if 0
@@ -212,27 +211,24 @@
 	int i, chain_len, max_chain_len, slots_used;
 	struct avc_node *node;
 	unsigned long flags;
 
-	spin_lock_irqsave(&avc_lock,flags);
+	rcu_read_lock();
 
 	slots_used = 0;
 	max_chain_len = 0;
 	for (i = 0; i < AVC_CACHE_SLOTS; i++) {
-		node = avc_cache.slots[i];
-		if (node) {
+		if (!list_empty(&avc_cache.slots[i])) {
 			slots_used++;
 			chain_len = 0;
-			while (node) {
+			list_for_each_entry_rcu(node, &avc_cache.slots[i])
 				chain_len++;
-				node = node->next;
-			}
 			if (chain_len > max_chain_len)
 				max_chain_len = chain_len;
 		}
 	}
 
-	spin_unlock_irqrestore(&avc_lock,flags);
+	rcu_read_unlock();
 
 	printk(KERN_INFO "\n");
 	printk(KERN_INFO "%s avc:  %d entries and %d/%d buckets used, longest "
 	       "chain length %d\n", tag, avc_cache.active_nodes, slots_used,
@@ -242,188 +238,210 @@
 static inline void avc_hash_eval(char *tag)
 { }
 #endif
 
-static inline struct avc_node *avc_reclaim_node(void)
+static void avc_node_free(struct rcu_head *rhead) {
+	struct avc_node *node;
+	node = container_of(rhead, struct avc_node, rhead);
+	kfree(node);
+}
+
+static inline int avc_reclaim_node(void)
 {
-	struct avc_node *prev, *cur;
-	int hvalue, try;
+	struct avc_node *node;
+	int hvalue, try, ecx;
 
-	hvalue = avc_cache.lru_hint;
-	for (try = 0; try < 2; try++) {
-		do {
-			prev = NULL;
-			cur = avc_cache.slots[hvalue];
-			while (cur) {
-				if (!cur->ae.used)
-					goto found;
+	for (try = 0, ecx = 0; try < AVC_CACHE_SLOTS; try++ ) {
+		hvalue = atomic_inc_return(&avc_cache.lru_hint) % AVC_CACHE_SLOTS;
 
-				cur->ae.used = 0;
+		if (!spin_trylock(&avc_cache.slots_lock[hvalue]))
+			continue;
 
-				prev = cur;
-				cur = cur->next;
+		list_for_each_entry(node, &avc_cache.slots[hvalue], list) {
+			if (!atomic_dec_and_test(&node->ae.used)) {
+				/* Recently Unused */
+				list_del_rcu(&node->list);
+				call_rcu(&node->rhead, avc_node_free);
+				atomic_dec(&avc_cache.active_nodes);
+				ecx++;
+				if (ecx >= AVC_CACHE_RECLAIM) {
+					spin_unlock(&avc_cache.slots_lock[hvalue]);
+					goto out;
+				}
 			}
-			hvalue = (hvalue + 1) & (AVC_CACHE_SLOTS - 1);
-		} while (hvalue != avc_cache.lru_hint);
+		}
+		spin_unlock(&avc_cache.slots_lock[hvalue]);
 	}
-
-	panic("avc_reclaim_node");
-
-found:
-	avc_cache.lru_hint = hvalue;
-
-	if (prev == NULL)
-		avc_cache.slots[hvalue] = cur->next;
-	else
-		prev->next = cur->next;
-
-	return cur;
+out:
+	return ecx;
 }
 
-static inline struct avc_node *avc_claim_node(u32 ssid,
-                                              u32 tsid, u16 tclass)
+static inline struct avc_node *avc_get_node(void)
 {
 	struct avc_node *new;
-	int hvalue;
-
-	hvalue = avc_hash(ssid, tsid, tclass);
-	if (avc_node_freelist) {
-		new = avc_node_freelist;
-		avc_node_freelist = avc_node_freelist->next;
-		avc_cache.active_nodes++;
-	} else {
-		new = avc_reclaim_node();
-		if (!new)
-			goto out;
-	}
+	int actives;
 
-	new->ae.used = 1;
-	new->ae.ssid = ssid;
-	new->ae.tsid = tsid;
-	new->ae.tclass = tclass;
-	new->next = avc_cache.slots[hvalue];
-	avc_cache.slots[hvalue] = new;
+	new = kmalloc(sizeof(struct avc_node), GFP_ATOMIC);
+	if (!new)
+		return NULL;
+
+	INIT_RCU_HEAD(&new->rhead);
+	INIT_LIST_HEAD(&new->list);
+
+	actives = atomic_inc_return(&avc_cache.active_nodes);
+	if (actives > AVC_CACHE_THRESHOLD)
+		avc_reclaim_node();
 
-out:
 	return new;
 }
 
 static inline struct avc_node *avc_search_node(u32 ssid, u32 tsid,
                                                u16 tclass, int *probes)
 {
-	struct avc_node *cur;
+	struct avc_node *node, *ret = NULL;
 	int hvalue;
 	int tprobes = 1;
 
 	hvalue = avc_hash(ssid, tsid, tclass);
-	cur = avc_cache.slots[hvalue];
-	while (cur != NULL &&
-	       (ssid != cur->ae.ssid ||
-		tclass != cur->ae.tclass ||
-		tsid != cur->ae.tsid)) {
+	list_for_each_entry_rcu(node, &avc_cache.slots[hvalue], list) {
+		if (ssid == node->ae.ssid &&
+		    tclass == node->ae.tclass &&
+		    tsid == node->ae.tsid) {
+			ret = node;
+			break;
+		}
 		tprobes++;
-		cur = cur->next;
 	}
 
-	if (cur == NULL) {
+	if (ret == NULL) {
 		/* cache miss */
 		goto out;
 	}
 
 	/* cache hit */
 	if (probes)
 		*probes = tprobes;
-
-	cur->ae.used = 1;
-
+	if (atomic_read(&ret->ae.used) != 1)
+		atomic_set(&ret->ae.used, 1);
 out:
-	return cur;
+	return ret;
 }
 
 /**
  * avc_lookup - Look up an AVC entry.
  * @ssid: source security identifier
  * @tsid: target security identifier
  * @tclass: target security class
  * @requested: requested permissions, interpreted based on @tclass
- * @aeref:  AVC entry reference
  *
  * Look up an AVC entry that is valid for the
  * @requested permissions between the SID pair
  * (@ssid, @tsid), interpreting the permissions
  * based on @tclass.  If a valid AVC entry exists,
- * then this function updates @aeref to refer to the
- * entry and returns %0. Otherwise, this function
- * returns -%ENOENT.
+ * then this function return the avc_node.
+ * Otherwise, this function returns NULL.
  */
-int avc_lookup(u32 ssid, u32 tsid, u16 tclass,
-               u32 requested, struct avc_entry_ref *aeref)
+struct avc_node *avc_lookup(u32 ssid, u32 tsid, u16 tclass, u32 requested)
 {
 	struct avc_node *node;
-	int probes, rc = 0;
+	int probes;
 
 	avc_cache_stats_incr(AVC_CAV_LOOKUPS);
 	node = avc_search_node(ssid, tsid, tclass,&probes);
 
 	if (node && ((node->ae.avd.decided & requested) == requested)) {
 		avc_cache_stats_incr(AVC_CAV_HITS);
 		avc_cache_stats_add(AVC_CAV_PROBES,probes);
-		aeref->ae = &node->ae;
 		goto out;
 	}
 
+	node = NULL;
 	avc_cache_stats_incr(AVC_CAV_MISSES);
-	rc = -ENOENT;
 out:
-	return rc;
+	return node;
+}
+
+static int avc_latest_notif_update(int seqno, int is_insert)
+{
+	int ret = 0;
+	static spinlock_t notif_lock = SPIN_LOCK_UNLOCKED;
+	unsigned long flag;
+
+	spin_lock_irqsave(&notif_lock, flag);
+	if (is_insert) {
+		if (seqno < avc_cache.latest_notif) {
+			printk(KERN_WARNING "avc:  seqno %d < latest_notif %d\n",
+			       seqno, avc_cache.latest_notif);
+			ret = -EAGAIN;
+		}
+	} else {
+		if (seqno > avc_cache.latest_notif)
+			avc_cache.latest_notif = seqno;
+	}
+	spin_unlock_irqrestore(&notif_lock, flag);
+
+	return ret;
 }
 
 /**
  * avc_insert - Insert an AVC entry.
  * @ssid: source security identifier
  * @tsid: target security identifier
  * @tclass: target security class
  * @ae: AVC entry
- * @aeref:  AVC entry reference
  *
  * Insert an AVC entry for the SID pair
  * (@ssid, @tsid) and class @tclass.
  * The access vectors and the sequence number are
  * normally provided by the security server in
  * response to a security_compute_av() call.  If the
  * sequence number @ae->avd.seqno is not less than the latest
  * revocation notification, then the function copies
- * the access vectors into a cache entry, updates
- * @aeref to refer to the entry, and returns %0.
- * Otherwise, this function returns -%EAGAIN.
+ * the access vectors into a cache entry, returns
+ * avc_node inserted. Otherwise, this function returns NULL.
  */
-int avc_insert(u32 ssid, u32 tsid, u16 tclass,
-               struct avc_entry *ae, struct avc_entry_ref *aeref)
+struct avc_node *avc_insert(u32 ssid, u32 tsid, u16 tclass, struct avc_entry *ae)
 {
-	struct avc_node *node;
-	int rc = 0;
+	struct avc_node *pos, *node = NULL;
+	int hvalue;
+	unsigned long flag;
 
-	if (ae->avd.seqno < avc_cache.latest_notif) {
-		printk(KERN_WARNING "avc:  seqno %d < latest_notif %d\n",
-		       ae->avd.seqno, avc_cache.latest_notif);
-		rc = -EAGAIN;
+	if (avc_latest_notif_update(ae->avd.seqno, 1))
 		goto out;
-	}
 
-	node = avc_claim_node(ssid, tsid, tclass);
-	if (!node) {
-		rc = -ENOMEM;
-		goto out;
-	}
+	node = avc_get_node();
+
+	if (node) {
+		hvalue = avc_hash(ssid, tsid, tclass);
 
-	node->ae.avd.allowed = ae->avd.allowed;
-	node->ae.avd.decided = ae->avd.decided;
-	node->ae.avd.auditallow = ae->avd.auditallow;
-	node->ae.avd.auditdeny = ae->avd.auditdeny;
-	node->ae.avd.seqno = ae->avd.seqno;
-	aeref->ae = &node->ae;
+		node->ae.ssid = ssid;
+		node->ae.tsid = tsid;
+		node->ae.tclass = tclass;
+		atomic_set(&node->ae.used, 1);
+
+		node->ae.avd.allowed = ae->avd.allowed;
+		node->ae.avd.decided = ae->avd.decided;
+		node->ae.avd.auditallow = ae->avd.auditallow;
+		node->ae.avd.auditdeny = ae->avd.auditdeny;
+		node->ae.avd.seqno = ae->avd.seqno;
+
+		spin_lock_irqsave(&avc_cache.slots_lock[hvalue], flag);
+		list_for_each_entry(pos, &avc_cache.slots[hvalue], list) {
+			if (pos->ae.ssid == ssid &&
+			    pos->ae.tsid == tsid &&
+			    pos->ae.tclass == tclass) {
+				list_replace_rcu(&pos->list, &node->list);
+				call_rcu(&pos->rhead, avc_node_free);
+				atomic_dec(&avc_cache.active_nodes);
+				goto found;
+			}
+		}
+		list_add_rcu(&node->list, &avc_cache.slots[hvalue]);
+found:
+		spin_unlock_irqrestore(&avc_cache.slots_lock[hvalue], flag);
+	}
 out:
-	return rc;
+	return node;
 }
 
 static inline void avc_print_ipv6_addr(struct audit_buffer *ab,
 				       struct in6_addr *addr, u16 port,
@@ -685,10 +703,51 @@
 {
 	return (x == y || x == SECSID_WILD || y == SECSID_WILD);
 }
 
-static inline void avc_update_node(u32 event, struct avc_node *node, u32 perms)
+/**
+ * avc_update_node Update an AVC entry
+ * @event : Updating event
+ * @perms : Permission mask bits
+ * @ssid,@tsid,@tclass : identifier of an AVC entry
+ * 
+ * if a valid AVC entry doesn't exist,this function returns -ENOENT.
+ * if kmalloc() called internal returns NULL, this function returns -ENOMEM.
+ * otherwise, this function update the AVC entry. The original AVC-entry object
+ * will release later by RCU.
+ */
+static int avc_update_node(u32 event, u32 perms, u32 ssid, u32 tsid, u16 tclass)
 {
+	int hvalue, rc = 0;
+	unsigned long flag;
+	struct avc_node *pos, *node, *org = NULL;
+
+	/* Lock the target slot */
+	hvalue = avc_hash(ssid, tsid, tclass);
+	spin_lock_irqsave(&avc_cache.slots_lock[hvalue], flag);
+
+	list_for_each_entry(pos, &avc_cache.slots[hvalue], list){
+		if ( ssid==pos->ae.ssid &&
+		     tsid==pos->ae.tsid &&
+		     tclass==pos->ae.tclass ){
+			org = pos;
+			break;
+		}
+	}
+
+	if (!org) {
+		rc = -ENOENT;
+		goto out;
+	}
+
+	node = kmalloc(sizeof(struct avc_node), GFP_ATOMIC);
+	if (!node) {
+		rc = -ENOMEM;
+		goto out;
+	}
+
+	/* Duplicate and Update */
+	memcpy(node, org, sizeof(struct avc_node));
 	switch (event) {
 	case AVC_CALLBACK_GRANT:
 		node->ae.avd.allowed |= perms;
 		break;
@@ -708,40 +767,41 @@
 	case AVC_CALLBACK_AUDITDENY_DISABLE:
 		node->ae.avd.auditdeny &= ~perms;
 		break;
 	}
+	list_replace_rcu(&org->list,&node->list);
+out:
+	spin_unlock_irqrestore(&avc_cache.slots_lock[hvalue], flag);
+
+	return rc;
 }
 
 static int avc_update_cache(u32 event, u32 ssid, u32 tsid,
                             u16 tclass, u32 perms)
 {
 	struct avc_node *node;
 	int i;
-	unsigned long flags;
 
-	spin_lock_irqsave(&avc_lock,flags);
+	rcu_read_lock();
 
 	if (ssid == SECSID_WILD || tsid == SECSID_WILD) {
 		/* apply to all matching nodes */
 		for (i = 0; i < AVC_CACHE_SLOTS; i++) {
-			for (node = avc_cache.slots[i]; node;
-			     node = node->next) {
+			list_for_each_entry_rcu(node, &avc_cache.slots[i], list) {
 				if (avc_sidcmp(ssid, node->ae.ssid) &&
 				    avc_sidcmp(tsid, node->ae.tsid) &&
-				    tclass == node->ae.tclass) {
-					avc_update_node(event,node,perms);
+				    tclass == node->ae.tclass ) {
+					avc_update_node(event, perms, node->ae.ssid
+					                ,node->ae.tsid, node->ae.tclass);
 				}
 			}
 		}
 	} else {
 		/* apply to one node */
-		node = avc_search_node(ssid, tsid, tclass, NULL);
-		if (node) {
-			avc_update_node(event,node,perms);
-		}
+		avc_update_node(event, perms, ssid, tsid, tclass);
 	}
 
-	spin_unlock_irqrestore(&avc_lock,flags);
+	rcu_read_unlock();
 
 	return 0;
 }
 
@@ -751,9 +811,8 @@
 {
 	struct avc_callback_node *c;
 	u32 tretained = 0, cretained = 0;
 	int rc = 0;
-	unsigned long flags;
 
 	/*
 	 * try_revoke only removes permissions from the cache
 	 * state if they are not retained by the object manager.
@@ -786,12 +845,9 @@
 		avc_update_cache(event,ssid,tsid,tclass,perms);
 		*out_retained = tretained;
 	}
 
-	spin_lock_irqsave(&avc_lock,flags);
-	if (seqno > avc_cache.latest_notif)
-		avc_cache.latest_notif = seqno;
-	spin_unlock_irqrestore(&avc_lock,flags);
+	avc_latest_notif_update(seqno, 0);
 
 out:
 	return rc;
 }
@@ -856,34 +912,21 @@
 int avc_ss_reset(u32 seqno)
 {
 	struct avc_callback_node *c;
 	int i, rc = 0;
-	struct avc_node *node, *tmp;
-	unsigned long flags;
+	unsigned long flag;
+	struct avc_node *node;
 
 	avc_hash_eval("reset");
 
-	spin_lock_irqsave(&avc_lock,flags);
-
 	for (i = 0; i < AVC_CACHE_SLOTS; i++) {
-		node = avc_cache.slots[i];
-		while (node) {
-			tmp = node;
-			node = node->next;
-			tmp->ae.ssid = tmp->ae.tsid = SECSID_NULL;
-			tmp->ae.tclass = SECCLASS_NULL;
-			tmp->ae.avd.allowed = tmp->ae.avd.decided = 0;
-			tmp->ae.avd.auditallow = tmp->ae.avd.auditdeny = 0;
-			tmp->ae.used = 0;
-			tmp->next = avc_node_freelist;
-			avc_node_freelist = tmp;
-			avc_cache.active_nodes--;
+		spin_lock_irqsave(&avc_cache.slots_lock[i], flag);
+		list_for_each_entry(node, &avc_cache.slots[i], list){
+			list_del_rcu(&node->list);
+			call_rcu(&node->rhead, avc_node_free);
 		}
-		avc_cache.slots[i] = NULL;
+		spin_unlock_irqrestore(&avc_cache.slots_lock[i], flag);
 	}
-	avc_cache.lru_hint = 0;
-
-	spin_unlock_irqrestore(&avc_lock,flags);
 
 	for (i = 0; i < AVC_NSTATS; i++)
 		avc_cache_stats[i] = 0;
 
@@ -895,12 +938,9 @@
 				goto out;
 		}
 	}
 
-	spin_lock_irqsave(&avc_lock,flags);
-	if (seqno > avc_cache.latest_notif)
-		avc_cache.latest_notif = seqno;
-	spin_unlock_irqrestore(&avc_lock,flags);
+	avc_latest_notif_update(seqno, 0);
 out:
 	return rc;
 }
 
@@ -949,9 +989,9 @@
  * @ssid: source security identifier
  * @tsid: target security identifier
  * @tclass: target security class
  * @requested: requested permissions, interpreted based on @tclass
- * @aeref:  AVC entry reference
+ * @aeref:  AVC entry reference(not in use)
  * @avd: access vector decisions
  *
  * Check the AVC to determine whether the @requested permissions are granted
  * for the SID pair (@ssid, @tsid), interpreting the permissions
@@ -968,72 +1008,44 @@
 int avc_has_perm_noaudit(u32 ssid, u32 tsid,
                          u16 tclass, u32 requested,
                          struct avc_entry_ref *aeref, struct av_decision *avd)
 {
-	struct avc_entry *ae;
+	struct avc_node *node;
+	struct avc_entry entry, *p_ae;
 	int rc = 0;
-	unsigned long flags;
-	struct avc_entry entry;
 	u32 denied;
-	struct avc_entry_ref ref;
 
-	if (!aeref) {
-		avc_entry_ref_init(&ref);
-		aeref = &ref;
-	}
-
-	spin_lock_irqsave(&avc_lock, flags);
+	rcu_read_lock();
 	avc_cache_stats_incr(AVC_ENTRY_LOOKUPS);
-	ae = aeref->ae;
-	if (ae) {
-		if (ae->ssid == ssid &&
-		    ae->tsid == tsid &&
-		    ae->tclass == tclass &&
-		    ((ae->avd.decided & requested) == requested)) {
-			avc_cache_stats_incr(AVC_ENTRY_HITS);
-			ae->used = 1;
-		} else {
-			avc_cache_stats_incr(AVC_ENTRY_DISCARDS);
-			ae = NULL;
-		}
-	}
 
-	if (!ae) {
-		avc_cache_stats_incr(AVC_ENTRY_MISSES);
-		rc = avc_lookup(ssid, tsid, tclass, requested, aeref);
-		if (rc) {
-			spin_unlock_irqrestore(&avc_lock,flags);
-			rc = security_compute_av(ssid,tsid,tclass,requested,&entry.avd);
-			if (rc)
-				goto out;
-			spin_lock_irqsave(&avc_lock, flags);
-			rc = avc_insert(ssid,tsid,tclass,&entry,aeref);
-			if (rc) {
-				spin_unlock_irqrestore(&avc_lock,flags);
-				goto out;
-			}
-		}
-		ae = aeref->ae;
+	node = avc_lookup(ssid, tsid, tclass, requested);
+	if (!node) {
+		rcu_read_unlock();
+		rc = security_compute_av(ssid,tsid,tclass,requested,&entry.avd);
+		if (rc)
+			goto out;
+		rcu_read_lock();
+		node = avc_insert(ssid,tsid,tclass,&entry);
 	}
 
+	p_ae = node ? &node->ae : &entry;
+
 	if (avd)
-		memcpy(avd, &ae->avd, sizeof(*avd));
+		memcpy(avd, &p_ae->avd, sizeof(*avd));
 
-	denied = requested & ~(ae->avd.allowed);
+	denied = requested & ~(p_ae->avd.allowed);
 
 	if (!requested || denied) {
 		if (selinux_enforcing) {
-			spin_unlock_irqrestore(&avc_lock,flags);
 			rc = -EACCES;
-			goto out;
 		} else {
-			ae->avd.allowed |= requested;
-			spin_unlock_irqrestore(&avc_lock,flags);
-			goto out;
+			if (node)
+				avc_update_node(AVC_CALLBACK_GRANT,requested
+				                ,ssid,tsid,tclass);
 		}
 	}
 
-	spin_unlock_irqrestore(&avc_lock,flags);
+	rcu_read_unlock();
 out:
 	return rc;
 }
 
@@ -1042,9 +1054,9 @@
  * @ssid: source security identifier
  * @tsid: target security identifier
  * @tclass: target security class
  * @requested: requested permissions, interpreted based on @tclass
- * @aeref:  AVC entry reference
+ * @aeref:  AVC entry reference(not in use)
  * @auditdata: auxiliary audit data
  *
  * Check the AVC to determine whether the @requested permissions are granted
  * for the SID pair (@ssid, @tsid), interpreting the permissions
@@ -1061,8 +1073,8 @@
 {
 	struct av_decision avd;
 	int rc;
 
-	rc = avc_has_perm_noaudit(ssid, tsid, tclass, requested, aeref, &avd);
+	rc = avc_has_perm_noaudit(ssid, tsid, tclass, requested, NULL, &avd);
 	avc_audit(ssid, tsid, tclass, requested, &avd, rc, auditdata);
 	return rc;
 }
diff -rNU4 linux-2.6.8.1/security/selinux/include/avc.h linux-2.6.8.1.rcu/security/selinux/include/avc.h
--- linux-2.6.8.1/security/selinux/include/avc.h	2004-08-14 19:54:51.000000000 +0900
+++ linux-2.6.8.1.rcu/security/selinux/include/avc.h	2004-08-20 20:40:50.000000000 +0900
@@ -118,13 +118,11 @@
  */
 
 void __init avc_init(void);
 
-int avc_lookup(u32 ssid, u32 tsid, u16 tclass,
-               u32 requested, struct avc_entry_ref *aeref);
+struct avc_node *avc_lookup(u32 ssid, u32 tsid, u16 tclass, u32 requested);
 
-int avc_insert(u32 ssid, u32 tsid, u16 tclass,
-               struct avc_entry *ae, struct avc_entry_ref *out_aeref);
+struct avc_node *avc_insert(u32 ssid, u32 tsid, u16 tclass, struct avc_entry *ae);
 
 void avc_audit(u32 ssid, u32 tsid,
                u16 tclass, u32 requested,
                struct av_decision *avd, int result, struct avc_audit_data *auditdata);

[-- Attachment #3: atomic_inc_return-2.6.8.1.patch --]
[-- Type: application/octet-stream, Size: 2739 bytes --]

--- linux-2.6.8.1/include/asm-arm/atomic.h	2004-08-14 19:54:50.000000000 +0900
+++ linux-2.6.8.1.rcu/include/asm-arm/atomic.h	2004-08-24 19:31:56.000000000 +0900
@@ -194,8 +194,10 @@
 #define atomic_dec(v)		atomic_sub(1, v)
 
 #define atomic_inc_and_test(v)	(atomic_add_return(1, v) == 0)
 #define atomic_dec_and_test(v)	(atomic_sub_return(1, v) == 0)
+#define atomic_inc_return(v)    (atomic_add_return(1, v))
+#define atomic_dec_return(v)    (atomic_sub_return(1, v))
 
 #define atomic_add_negative(i,v) (atomic_add_return(i, v) < 0)
 
 /* Atomic operations are already serializing on ARM */
--- linux-2.6.8.1/include/asm-i386/atomic.h	2004-08-14 19:55:09.000000000 +0900
+++ linux-2.6.8.1.rcu/include/asm-i386/atomic.h	2004-08-25 11:57:08.000000000 +0900
@@ -175,8 +175,34 @@
 		:"ir" (i), "m" (v->counter) : "memory");
 	return c;
 }
 
+/**
+ * atomic_add_return - add and return
+ * @v: pointer of type atomic_t
+ * @i: integer value to add
+ *
+ * Atomically adds @i to @v and returns @i + @v
+ */
+static __inline__ int atomic_add_return(int i, atomic_t *v)
+{
+	/* NOTICE: This function can be used on i486+ */
+	int __i = i;
+	__asm__ __volatile__(
+		LOCK "xaddl %0, %1;"
+		:"=r"(i)
+		:"m"(v->counter), "0"(i));
+	return i + __i;
+}
+
+static __inline__ int atomic_sub_return(int i, atomic_t *v)
+{
+	return atomic_add_return(-i,v);
+}
+
+#define atomic_inc_return(v)  (atomic_add_return(1,v))
+#define atomic_dec_return(v)  (atomic_sub_return(1,v))
+
 /* These are x86-specific, used by some header files */
 #define atomic_clear_mask(mask, addr) \
 __asm__ __volatile__(LOCK "andl %0,%1" \
 : : "r" (~(mask)),"m" (*addr) : "memory")
--- linux-2.6.8.1/include/asm-x86_64/atomic.h	2004-08-14 19:56:23.000000000 +0900
+++ linux-2.6.8.1.rcu/include/asm-x86_64/atomic.h	2004-08-25 11:57:36.000000000 +0900
@@ -177,8 +177,33 @@
 		:"ir" (i), "m" (v->counter) : "memory");
 	return c;
 }
 
+/**
+ * atomic_add_return - add and return
+ * @v: pointer of type atomic_t
+ * @i: integer value to add
+ *
+ * Atomically adds @i to @v and returns @i + @v
+ */
+static __inline__ int atomic_add_return(int i, atomic_t *v)
+{
+	int __i = i;
+	__asm__ __volatile__(
+		LOCK "xaddl %0, %1;"
+		:"=r"(i)
+		:"m"(v->counter), "0"(i));
+	return i + __i;
+}
+
+static __inline__ int atomic_sub_return(int i, atomic_t *v)
+{
+	return atomic_add_return(-i,v);
+}
+
+#define atomic_inc_return(v)  (atomic_add_return(1,v))
+#define atomic_dec_return(v)  (atomic_sub_return(1,v))
+
 /* These are x86-specific, used by some header files */
 #define atomic_clear_mask(mask, addr) \
 __asm__ __volatile__(LOCK "andl %0,%1" \
 : : "r" (~(mask)),"m" (*addr) : "memory")

[-- Attachment #4: list_replace_rcu-2.6.8.1.patch --]
[-- Type: application/octet-stream, Size: 845 bytes --]

--- linux-2.6.8.1/include/linux/list.h	2004-08-14 19:55:33.000000000 +0900
+++ linux-2.6.8.1.rcu/include/linux/list.h	2004-08-20 18:04:10.000000000 +0900
@@ -194,8 +194,23 @@
 	__list_del(entry->prev, entry->next);
 	entry->prev = LIST_POISON2;
 }
 
+/*
+ * list_replace_rcu - replace old entry by new onw from list
+ * @old : the element to be replaced from the list.
+ * @new : the new element to insert to the list.
+ * 
+ * The old entry will be replaced to the new entry atomically.
+ */
+static inline void list_replace_rcu(struct list_head *old, struct list_head *new){
+	new->next = old->next;
+	new->prev = old->prev;
+	smp_wmb();
+	new->next->prev = new;
+	new->prev->next = new;
+}
+
 /**
  * list_del_init - deletes entry from list and reinitialize it.
  * @entry: the element to delete from the list.
  */

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH]SELinux performance improvement by RCU (Re: RCU issue with SELinux)
  2004-08-30 11:17                   ` [PATCH]SELinux performance improvement by RCU (Re: RCU issue with SELinux) Kaigai Kohei
@ 2004-08-30 15:35                     ` Stephen Smalley
  2004-08-30 16:13                       ` Paul E. McKenney
  2004-08-31 15:33                     ` James Morris
  1 sibling, 1 reply; 31+ messages in thread
From: Stephen Smalley @ 2004-08-30 15:35 UTC (permalink / raw)
  To: Kaigai Kohei
  Cc: SELinux-ML(Eng), Linux Kernel ML(Eng), James Morris, Paul E. McKenney

On Mon, 2004-08-30 at 07:17, Kaigai Kohei wrote:
> I fixed the take-3 patch according to your and Paul's suggestions.
> 
> The attached take-4 patches replace the avc_lock in security/selinux/avc.c
> by the lock-less read access with RCU.

Thanks.  Was there a reason you didn't move the rcu_read_lock call after
the avc_insert call per the suggestion of Paul McKenney, or was that
just an oversight?  No need to send a new patch, just ack whether or not
you meant to switch the order there.

-- 
Stephen Smalley <sds@epoch.ncsc.mil>
National Security Agency


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH]SELinux performance improvement by RCU (Re: RCU issue with SELinux)
  2004-08-30 15:35                     ` Stephen Smalley
@ 2004-08-30 16:13                       ` Paul E. McKenney
  2004-08-31  4:33                         ` Kaigai Kohei
  0 siblings, 1 reply; 31+ messages in thread
From: Paul E. McKenney @ 2004-08-30 16:13 UTC (permalink / raw)
  To: Stephen Smalley
  Cc: Kaigai Kohei, SELinux-ML(Eng), Linux Kernel ML(Eng), James Morris

On Mon, Aug 30, 2004 at 11:35:19AM -0400, Stephen Smalley wrote:
> On Mon, 2004-08-30 at 07:17, Kaigai Kohei wrote:
> > I fixed the take-3 patch according to your and Paul's suggestions.
> > 
> > The attached take-4 patches replace the avc_lock in security/selinux/avc.c
> > by the lock-less read access with RCU.
> 
> Thanks.  Was there a reason you didn't move the rcu_read_lock call after
> the avc_insert call per the suggestion of Paul McKenney, or was that
> just an oversight?  No need to send a new patch, just ack whether or not
> you meant to switch the order there.

One reason might be because I called it out in the text of my message,
but failed to put it in my patch.  :-/  Of course, if there is some reason
why moving the rcu_read_lock() call is bad, I would like to know for
my own education.

						Thanx, Paul

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH]SELinux performance improvement by RCU (Re: RCU issue with SELinux)
  2004-08-30 16:13                       ` Paul E. McKenney
@ 2004-08-31  4:33                         ` Kaigai Kohei
  2004-08-31 16:20                           ` Paul E. McKenney
  0 siblings, 1 reply; 31+ messages in thread
From: Kaigai Kohei @ 2004-08-31  4:33 UTC (permalink / raw)
  To: paulmck, Stephen Smalley
  Cc: SELinux-ML(Eng), Linux Kernel ML(Eng), James Morris

Hi Stephen, Paul, thanks for your comments.

> > > The attached take-4 patches replace the avc_lock in security/selinux/avc.c
> > > by the lock-less read access with RCU.
> > 
> > Thanks.  Was there a reason you didn't move the rcu_read_lock call after
> > the avc_insert call per the suggestion of Paul McKenney, or was that
> > just an oversight?  No need to send a new patch, just ack whether or not
> > you meant to switch the order there.
> 
> One reason might be because I called it out in the text of my message,
> but failed to put it in my patch.  :-/  Of course, if there is some reason
> why moving the rcu_read_lock() call is bad, I would like to know for
> my own education.

In my understanding, the issue is the Paul's suggestion as follows:

> So I do not believe that avc_insert() needs rcu_read_lock().
> Unless I am missing something, the rcu_read_lock() acquired
> in avc_has_perm_noaudit() should be moved after the call to
> avc_insert().

I don't move the rcu_read_lock() because of the possibility of preemption
between the spin_unlock_irqrestore() in avc_insert() and the rcu_read_lock()
which may be inserted after avc_insert() in avc_has_perm_noaudit().

When it's returning from avc_insert(), we can't ignore the possibility
that execution is preempted in this timing.
Therefore, I didn't move rcu_read_lock() in spite of its redundancy.

If rcu_read_lock() was moved after avc_insert()
[ in avc_insert() ]----------------------------
                :
        spin_lock_irqsave(&avc_cache.slots_lock[hvalue], flag);
        list_for_each_entry(pos, &avc_cache.slots[hvalue], list) {
                :
        }
        list_add_rcu(&node->list, &avc_cache.slots[hvalue]);
found:
        spin_unlock_irqrestore(&avc_cache.slots_lock[hvalue], flag);  ---------
        //  +--> including preempt_enable()                               |
                It has the danger of releasing the 'node'.                V
    }                                                                preemption
out:                                                                     is
    return node;                                                       possible
}
-----------------------------------------------
Because it's legal to hold the rcu_read_lock() twice as Paul says,
we should do it for safety.
It's the reason that I didn't move rcu_read_lock() at this point,
and it might be lack of my explanation, sorry.

Thanks.
--------
Kai Gai <kaigai@ak.jp.nec.com>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH]SELinux performance improvement by RCU (Re: RCU issue with SELinux)
  2004-08-30 11:17                   ` [PATCH]SELinux performance improvement by RCU (Re: RCU issue with SELinux) Kaigai Kohei
  2004-08-30 15:35                     ` Stephen Smalley
@ 2004-08-31 15:33                     ` James Morris
  1 sibling, 0 replies; 31+ messages in thread
From: James Morris @ 2004-08-31 15:33 UTC (permalink / raw)
  To: Kaigai Kohei
  Cc: Stephen Smalley, SELinux-ML(Eng), Linux Kernel ML(Eng), Rik van Riel

I've performed some benchmarking and profiling of these patches, comparing 
the vanilla kernel with SELinux disabled, the same with SELinux enabled, 
and then SELinux enabled with the RCU patches.

The benchmarks I used are apachebench, lmbench, dbench and webstone.

It seems that the patches are providing good scalability from around four
CPUs and up, although baseline performance has not been significantly 
improved.

This benchmarking also confirms some suspected network-specific
performance issues, which need addressing separately.  e.g. port and node 
caches need to be implemented.

Results + before/after profiling below.


- James
-- 
James Morris
<jmorris@redhat.com>



test system : 8 way 900 MHz Xeon, 4G RAM, 3 x gigabit interfaces.

orig     = vanilla 2.6.8.1 kernel
nec1     = 2.6.8.1 kernel with nec patches
disabled = SELinux is disabled
enabled  = SELinux is enabled (strict policy)

----------------------------------------------------------------------------
apachebench
----------------------------------------------------------------------------

Configuration: 5k file 100k times via localhost.

Transfer rate in MB/s:

# Clients       1       2       4       8
-------------------------------------------------
orig disabled : 12.269  22.339  23.610  20.867
orig enabled  :  9.319  16.594  17.532  16.322
nec1 enabled  :  9.783  17.573  18.373  17.006


----------------------------------------------------------------------------
dbench
----------------------------------------------------------------------------

Configuration: standard client.txt, ext2, average of 10 runs each.

Througput in MB/s:

# Clients       1      2      3      4      5      6      7      8
----------------------------------------------------------------------
orig disabled : 156.8  279.2  383.7  470.0  547.9  610.5  651.7  713.7
orig enabled  : 146.2  260.0  359.1  418.8  450.3  503.6  489.5  446.2
nec1 enabled  : 146.1  261.7  356.9  451.1  515.3  596.5  629.4  684.2


----------------------------------------------------------------------------
Webstone 2.5
----------------------------------------------------------------------------

Configuration:
- two physical clients, each with private gigabit to server.
- standard fileset
- two mins per run, one iteration

Server throughput in Mb/s:

# Clients        2      4       8       16      32
---------------------------------------------------------
orig disabled  : 66.55  302.40  380.16  421.40  434.08
orig enabled   : 64.96  276.34  358.87  388.73  389.43
nec1 enabled   : 64.33  282.50  367.29  403.66  409.65
---------------------------------------------------------


----------------------------------------------------------------------------
lmbench 3.0
----------------------------------------------------------------------------

Basic system parameters
------------------------------------------------------------------------------
Host                 OS Description              Mhz  tlb  cache  mem   scal
                                                     pages line   par   load
                                                           bytes  
--------- ------------- ----------------------- ---- ----- ----- ------ ----
xeon4.lab orig--disable       i686-pc-linux-gnu  898          32           1
xeon4.lab orig--enable        i686-pc-linux-gnu  898          32           1
xeon4.lab nec1--enable        i686-pc-linux-gnu  898          32           1

Processor, Processes - times in microseconds - smaller is better
------------------------------------------------------------------------------
Host                 OS  Mhz null null      open slct sig  sig  fork exec sh  
                             call  I/O stat clos TCP  inst hndl proc proc proc
--------- ------------- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ----
xeon4.lab orig--disable  898 0.24 0.40 5.13 5.90 25.2 0.99 3.98 185. 1006 2853
xeon4.lab orig--enable   898 0.24 0.81 8.62 9.91 23.1 0.99 3.97 189. 1038 3027
xeon4.lab nec1--enable   898 0.24 0.75 8.54 9.65 23.1 0.99 4.02 187. 987. 3073

Context switching - times in microseconds - smaller is better
-------------------------------------------------------------------------
Host                 OS  2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
                         ctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw   ctxsw
--------- ------------- ------ ------ ------ ------ ------ ------- -------
xeon4.lab orig--disable   11.3   21.5   20.1   17.3   14.0    11.4    10.8
xeon4.lab orig--enable    21.0   28.2   26.4   17.5   21.3 8.92000    17.4
xeon4.lab nec1--enable    11.4   27.2 2.5600   13.1   24.0 7.62000    11.8

*Local* Communication latencies in microseconds - smaller is better
---------------------------------------------------------------------
Host                 OS 2p/0K  Pipe AF     UDP  RPC/   TCP  RPC/ TCP
                        ctxsw       UNIX         UDP         TCP conn
--------- ------------- ----- ----- ---- ----- ----- ----- ----- ----
xeon4.lab orig--disable  11.3  26.2 25.6              52.2        95.
xeon4.lab orig--enable   21.0  26.5 45.1              97.2       148.
xeon4.lab nec1--enable   11.4  27.6 45.8              71.5       142.

File & VM system latencies in microseconds - smaller is better
-------------------------------------------------------------------------------
Host                 OS   0K File      10K File     Mmap    Prot   Page   100fd
                        Create Delete Create Delete Latency Fault  Fault  selct
--------- ------------- ------ ------ ------ ------ ------- ----- ------- -----
xeon4.lab orig--disable   31.5   23.5   86.3   48.1  1778.0 0.228 2.03950  16.9
xeon4.lab orig--enable    55.7   36.0  111.7   57.5  1668.0 0.288 1.99780  15.8
xeon4.lab nec1--enable    55.3   35.0  112.8   56.3  1715.0 0.160 2.03220  15.6

*Local* Communication bandwidths in MB/s - bigger is better
-----------------------------------------------------------------------------
Host                OS  Pipe AF    TCP  File   Mmap  Bcopy  Bcopy  Mem   Mem
                             UNIX      reread reread (libc) (hand) read write
--------- ------------- ---- ---- ---- ------ ------ ------ ------ ---- -----
xeon4.lab orig--disable 519. 163. 181.  226.9  256.9  134.3  128.1 257. 199.7
xeon4.lab orig--enable  513. 163. 122.  226.4  256.9  134.3  129.5 256. 197.5
xeon4.lab nec1--enable  504. 294. 121.  226.7  257.2  134.1  129.7 257. 200.4


----------------------------------------------------------------------------
Profiling
----------------------------------------------------------------------------

apachebench (8 clients, 100k iter, 5k file, localhost):

orig-enable:

253368 total                                      0.1025
187668 default_idle                             2932.3125
  6024 avc_has_perm_noaudit                      12.5500
  4124 security_port_sid                         25.7750
  3784 finish_task_switch                        29.5625
  2899 .text.lock.avc                            14.7157
  1717 tcp_v4_rcv                                 0.7107
  1462 __wake_up                                 18.2750
  1032 avc_audit                                  0.3395
  1023 ip_queue_xmit                              0.6208
   984 selinux_socket_sock_rcv_skb                1.3667
   927 tcp_transmit_skb                           0.5315
   788 __kfree_skb                                2.3452
   750 __copy_user_intel                          4.2614
   680 security_node_sid                          2.6562
   657 netif_rx                                   1.4159
   653 tcp_recvmsg                                0.3549
   644 alloc_skb                                  2.5156
   582 ip_finish_output2                          1.1734
   563 __d_lookup                                 1.8520
   541 tcp_rcv_established                        0.2348
   539 tcp_poll                                   1.2957
   524 sel_netif_lookup                           2.0469
   523 avc_has_perm                               4.2520
   517 __mod_timer                                1.5387
   513 atomic_dec_and_lock                        7.7727
   506 sysenter_past_esp                          4.4779
   500 skb_release_data                           3.4722
   493 dev_queue_xmit                             0.6556
   490 kmem_cache_alloc                           6.1250
   482 lock_sock                                  6.0250
   474 try_to_wake_up                             0.6733
   461 __kmalloc                                  3.2014
   425 fget                                       6.6406
   413 tcp_clean_rtx_queue                        0.4232
   413 __copy_to_user_ll                          3.6875
   408 sel_netif_find                             6.3750
   405 .text.lock.sock                            2.6299
   403 skb_copy_bits                              0.6458
   398 kmem_cache_free                            4.1458
  
nec1-enable:

232223 total                                      0.0940
177690 default_idle                             2776.4062
  3729 finish_task_switch                        29.1328
  3656 security_port_sid                         22.8500
  1545 tcp_v4_rcv                                 0.6395
  1422 __wake_up                                 17.7750
   979 selinux_socket_sock_rcv_skb                1.3597
   924 ip_queue_xmit                              0.5607
   903 tcp_transmit_skb                           0.5178
   863 avc_lookup                                 4.4948
   752 __copy_user_intel                          4.2727
   722 memcpy                                    11.2812
   720 __kfree_skb                                2.1429
   714 avc_has_perm_noaudit                       2.2313
   676 tcp_recvmsg                                0.3674
   632 netif_rx                                   1.3621
   598 security_node_sid                          2.3359
   595 alloc_skb                                  2.3242
   547 __mod_timer                                1.6280
   507 __d_lookup                                 1.6678
   489 sysenter_past_esp                          4.3274
   485 skb_release_data                           3.3681
   484 tcp_rcv_established                        0.2101
   480 tcp_poll                                   1.1538
   474 ip_finish_output2                          0.9556
   468 try_to_wake_up                             0.6648
   467 atomic_dec_and_lock                        7.0758
   431 avc_has_perm                               3.5328
   420 tcp_create_openreq_child                   0.3125
   415 lock_sock                                  5.1875
   414 __copy_to_user_ll                          3.6964
   410 skb_copy_bits                              0.6571
   410 sel_netif_lookup                           1.6016
   407 fget                                       6.3594
   398 .text.lock.sock                            2.5844
   397 tcp_clean_rtx_queue                        0.4068
   397 dev_queue_xmit                             0.5279
   392 do_gettimeofday                            1.8846
   362 sel_netif_find                             5.6562
   362 selinux_ip_postroute_last                  0.5518


dbench (8 clients, 10 iter):

orig-enable:

432070 total                                      0.1749
125306 .text.lock.avc                           636.0711
 48451 __copy_user_intel                        275.2898
 34015 __copy_user_zeroing_intel                193.2670
 30862 avc_has_perm_noaudit                      64.2958
 27203 __copy_to_user_ll                        242.8839
 26618 __copy_from_user_ll                      237.6607
 21745 avc_audit                                  7.1530
 18133 default_idle                             283.3281
  6536 __d_lookup                                21.5000
  4034 link_path_walk                             1.0962
  3782 find_get_page                             47.2750
  3280 atomic_dec_and_lock                       49.6970
  2982 do_generic_mapping_read                    2.8673
  2931 kmap_atomic                               20.3542
  2504 __block_prepare_write                      2.3712
  2483 path_lookup                                6.7473
  1860 inode_has_perm                            12.9167
  1847 buffered_rmqueue                           3.3952
  1719 generic_file_aio_write_nolock              0.5165
  1713 file_move                                 26.7656
  1441 avc_has_perm                              11.7154
  1258 current_kernel_time                       15.7250
  1225 sysenter_past_esp                         10.8407
  1198 generic_commit_write                       8.3194
  1165 __find_get_block                           6.0677
  1123 put_page                                   7.7986
  1097 wake_up_page                              13.7125
  1092 page_address                               6.2045
  1074 find_lock_page                             6.1023
  1039 selinux_file_permission                    2.5975
  1030 __block_commit_write                       4.9519
   952 kmem_cache_free                            9.9167
   931 kmem_cache_alloc                          11.6375
   916 do_page_cache_readahead                    2.2019
   899 in_group_p                                 7.0234
   855 block_invalidatepage                       3.5625
   845 file_read_actor                            3.5208
   800 selinux_inode_permission                   3.8462
   782 find_get_pages                             6.9821

nec1-enable:

275604 total                                      0.1115
 55272 __copy_user_intel                        314.0455
 36716 __copy_user_zeroing_intel                208.6136
 29933 __copy_from_user_ll                      267.2589
 29544 __copy_to_user_ll                        263.7857
 25789 default_idle                             402.9531
  4419 __d_lookup                                14.5362
  3636 find_get_page                             45.4500
  2800 do_generic_mapping_read                    2.6923
  2788 avc_has_perm_noaudit                       8.7125
  2768 link_path_walk                             0.7522
  2766 __block_prepare_write                      2.6193
  2706 path_lookup                                7.3533
  2549 kmap_atomic                               17.7014
  2282 atomic_dec_and_lock                       34.5758
  1935 buffered_rmqueue                           3.5570
  1745 generic_file_aio_write_nolock              0.5243
  1614 inode_has_perm                            11.2083
  1344 generic_commit_write                       9.3333
  1238 current_kernel_time                       15.4750
  1190 sysenter_past_esp                         10.5310
  1184 file_move                                 18.5000
  1172 __find_get_block                           6.1042
  1141 wake_up_page                              14.2625
  1098 avc_lookup                                 5.7188
  1083 __block_commit_write                       5.2067
  1033 put_page                                   7.1736
  1027 in_group_p                                 8.0234
  1018 find_lock_page                             5.7841
   992 page_address                               5.6364
   990 selinux_file_permission                    2.4750
   930 memcpy                                    14.5312
   888 block_invalidatepage                       3.7000
   855 file_read_actor                            3.5625
   846 kmem_cache_free                            8.8125
   828 free_hot_cold_page                         2.8750
   809 kmem_cache_alloc                          10.1125
   798 find_get_pages                             7.1250
   793 ext2_free_branches                         1.5988
   774 avc_has_perm                               6.3443
  



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH]SELinux performance improvement by RCU (Re: RCU issue with SELinux)
  2004-08-31  4:33                         ` Kaigai Kohei
@ 2004-08-31 16:20                           ` Paul E. McKenney
  0 siblings, 0 replies; 31+ messages in thread
From: Paul E. McKenney @ 2004-08-31 16:20 UTC (permalink / raw)
  To: Kaigai Kohei
  Cc: Stephen Smalley, SELinux-ML(Eng), Linux Kernel ML(Eng), James Morris

On Tue, Aug 31, 2004 at 01:33:33PM +0900, Kaigai Kohei wrote:
> Hi Stephen, Paul, thanks for your comments.
> 
> > > > The attached take-4 patches replace the avc_lock in security/selinux/avc.c
> > > > by the lock-less read access with RCU.
> > > 
> > > Thanks.  Was there a reason you didn't move the rcu_read_lock call after
> > > the avc_insert call per the suggestion of Paul McKenney, or was that
> > > just an oversight?  No need to send a new patch, just ack whether or not
> > > you meant to switch the order there.
> > 
> > One reason might be because I called it out in the text of my message,
> > but failed to put it in my patch.  :-/  Of course, if there is some reason
> > why moving the rcu_read_lock() call is bad, I would like to know for
> > my own education.
> 
> In my understanding, the issue is the Paul's suggestion as follows:
> 
> > So I do not believe that avc_insert() needs rcu_read_lock().
> > Unless I am missing something, the rcu_read_lock() acquired
> > in avc_has_perm_noaudit() should be moved after the call to
> > avc_insert().
> 
> I don't move the rcu_read_lock() because of the possibility of preemption
> between the spin_unlock_irqrestore() in avc_insert() and the rcu_read_lock()
> which may be inserted after avc_insert() in avc_has_perm_noaudit().
> 
> When it's returning from avc_insert(), we can't ignore the possibility
> that execution is preempted in this timing.
> Therefore, I didn't move rcu_read_lock() in spite of its redundancy.
> 
> If rcu_read_lock() was moved after avc_insert()
> [ in avc_insert() ]----------------------------
>                 :
>         spin_lock_irqsave(&avc_cache.slots_lock[hvalue], flag);
>         list_for_each_entry(pos, &avc_cache.slots[hvalue], list) {
>                 :
>         }
>         list_add_rcu(&node->list, &avc_cache.slots[hvalue]);
> found:
>         spin_unlock_irqrestore(&avc_cache.slots_lock[hvalue], flag);  ---------
>         //  +--> including preempt_enable()                               |
>                 It has the danger of releasing the 'node'.                V
>     }                                                                preemption
> out:                                                                     is
>     return node;                                                       possible
> }
> -----------------------------------------------
> Because it's legal to hold the rcu_read_lock() twice as Paul says,
> we should do it for safety.
> It's the reason that I didn't move rcu_read_lock() at this point,
> and it might be lack of my explanation, sorry.

Works for me!  Might be worth adding a comment, though.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2004-08-31 16:25 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-08-16  9:33 RCU issue with SELinux (Re: SELINUX performance issues) Kaigai Kohei
2004-08-16 15:19 ` James Morris
2004-08-20 13:36   ` Kaigai Kohei
2004-08-20 14:53     ` James Morris
2004-08-24  7:27       ` Kaigai Kohei
2004-08-24 13:24         ` James Morris
2004-08-25  9:51           ` Kaigai Kohei
2004-08-25 18:31             ` James Morris
2004-08-25  9:52           ` [PATCH]atomic_inc_return() for i386/x86_64 (Re: RCU issue with SELinux) Kaigai Kohei
2004-08-20 17:31     ` RCU issue with SELinux (Re: SELINUX performance issues) Luke Kenneth Casson Leighton
2004-08-20 18:15       ` James Morris
2004-08-20 20:19     ` Paul E. McKenney
2004-08-20 20:35       ` James Morris
2004-08-24  7:27       ` Kaigai Kohei
     [not found]     ` <1093014789.16585.186.camel@moss-spartans.epoch.ncsc.mil>
2004-08-24  7:25       ` Kaigai Kohei
2004-08-24 15:37         ` Stephen Smalley
2004-08-25  9:51           ` Kaigai Kohei
2004-08-25 15:50             ` Stephen Smalley
2004-08-25 16:11               ` Stephen Smalley
2004-08-26  7:53               ` Kaigai Kohei
2004-08-26 13:24                 ` Stephen Smalley
2004-08-27 11:07                   ` Kaigai Kohei
2004-08-30 11:17                   ` [PATCH]SELinux performance improvement by RCU (Re: RCU issue with SELinux) Kaigai Kohei
2004-08-30 15:35                     ` Stephen Smalley
2004-08-30 16:13                       ` Paul E. McKenney
2004-08-31  4:33                         ` Kaigai Kohei
2004-08-31 16:20                           ` Paul E. McKenney
2004-08-31 15:33                     ` James Morris
2004-08-24 23:02         ` RCU issue with SELinux (Re: SELINUX performance issues) Paul E. McKenney
2004-08-25  9:51           ` Kaigai Kohei
2004-08-25 17:34             ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).