From mboxrd@z Thu Jan 1 00:00:00 1970 From: Honnappa Nagarahalli Subject: Re: [PATCH 2/4] hash: add memory ordering to avoid race conditions Date: Sun, 30 Sep 2018 22:20:59 +0000 Message-ID: References: <1536253938-192391-1-git-send-email-honnappa.nagarahalli@arm.com> <1536253938-192391-3-git-send-email-honnappa.nagarahalli@arm.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Cc: "dev@dpdk.org" , "Gavin Hu (Arm Technology China)" , Steve Capper , Ola Liljedahl , nd To: "Wang, Yipeng1" , "Richardson, Bruce" , "De Lara Guarch, Pablo" Return-path: Received: from EUR02-VE1-obe.outbound.protection.outlook.com (mail-eopbgr20067.outbound.protection.outlook.com [40.107.2.67]) by dpdk.org (Postfix) with ESMTP id B5596160 for ; Mon, 1 Oct 2018 00:21:01 +0200 (CEST) In-Reply-To: Content-Language: en-US List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" >=20 > Some general comments for the various __atomic_store/load added, >=20 > 1. Although it passes the compiler check, but I just want to confirm that= if we > should use GCC/clang builtins, or if There are higher level APIs in DPDK = to do > atomic operations? >=20 I have used gcc builtins (just like rte_ring does) > 2. We believe compiler will translate the atomic_store/load to regular MO= V > instruction on Total Store Order architecture (e.g. X86_64). But we run t= he > perf test on x86 and here is the relative slowdown on lookup comparing to > master head. I am not sure if the performance drop comes from the atomic > buitins. >=20 C11 atomics also block compiler reordering. Other than this, the retry loop= is an addition to lookup. The patch also has the alignment corrected. I am not sure how is that affec= ting the perf numbers. > Keysize | single lookup | bulk lookup > 4 | 0.93 | 0.95 > 8 | 0.95 | 0.96 > 16 | 0.97 | 0.96 > 32 | 0.97 | 1.00 > 48 | 1.03 | 0.99 > 64 | 1.04 | 0.98 > 9 | 0.91 | 0.96 > 13 | 0.97 | 0.98 > 37 | 1.04 | 1.03 > 40 | 1.02 | 0.98 >=20 I assume this is the data from the test cases in test_hash_perf.c file. I t= ried to reproduce this data, but my data is worse. Can you specify the actu= al test from test_hash_perf.c you are using (With locks/Pre-computed hash/W= ith data/Elements in primary)? IMO, the differences you have provided are not high. > Other comments inlined. >=20 > >-----Original Message----- > >Only race condition that can occur is - using the key store element > >before the key write is completed. Hence, while inserting the element > >the release memory order is used. Any other race condition is caught by > >the key comparison. Memory orderings are added only where needed. > >For ex: reads in the writer's context do not need memory ordering as > >there is a single writer. > > > [Wang, Yipeng] I assume this commit works together with the next one to > enable read-write concurrency on relaxed memory ordering machine. This > commit by itself does not do that, right? > If this is the case, should we merge the two commits, or maybe just expla= in it > a little bit more in the commit message? This commit works only with the next commit. I kept it separated to make it= easy for review. I will merge it in the final patch. >=20 > >key_idx in the bucket entry and pdata in the key store element are used > >for synchronisation. key_idx is used to release an inserted entry in > >the bucket to the reader. Use of pdata for synchronisation is required > >due to updation of an existing entry where-in only the pdata is updated > >without updating key_idx. > > > > > >-/* Search a key from bucket and update its data */ > >+/* Search a key from bucket and update its data. > >+ * Writer holds the lock before calling this. > >+ */ > > static inline int32_t > > search_and_update(const struct rte_hash *h, void *data, const void *key= , > > struct rte_hash_bucket *bkt, hash_sig_t sig, hash_sig_t alt_hash) @@ > >-499,8 +501,13 @@ search_and_update(const struct rte_hash *h, void *data= , > const void *key, > > k =3D (struct rte_hash_key *) ((char *)keys + > > bkt->key_idx[i] * h->key_entry_size); > > if (rte_hash_cmp_eq(key, k->key, h) =3D=3D 0) { > >- /* Update data */ > >- k->pdata =3D data; > >+ /* 'pdata' acts as the synchronization point > >+ * when an existing hash entry is updated. > >+ * Key is not updated in this case. > >+ */ > [Wang, Yipeng] I did not quite understand why do we need synchronization > for hash data update. > Since pdata write is already atomic, the lookup will either read out the = stale > data or the new data, which should be fine without synchronization. > Is it to ensure the order of multiple reads in lookup threads? Yes, it is to ensure the order of reads in the lookup threads. There might = be reads in the application and we want to make sure they do not get reorde= red with this. >=20 > >@@ -1243,11 +1289,19 @@ __rte_hash_lookup_bulk(const struct rte_hash > *h, const void **keys, > > while (sec_hitmask[i]) { > > uint32_t hit_index =3D __builtin_ctzl(sec_hitmask[i]); > > > >- uint32_t key_idx =3D secondary_bkt[i]- > >key_idx[hit_index]; > >+ uint32_t key_idx =3D > >+ __atomic_load_n( > >+ &secondary_bkt[i]->key_idx[hit_index], > >+ __ATOMIC_ACQUIRE); > > const struct rte_hash_key *key_slot =3D > > (const struct rte_hash_key *)( > > (const char *)h->key_store + > > key_idx * h->key_entry_size); > >+ > >+ if (key_idx !=3D EMPTY_SLOT) > [Wang, Yipeng] I think even for current code, we need to check empty_slot= . > Could you export this as a bug fix commit? >=20 In the existing code, there is check 'if (!!key_idx & !rte_hash....)'. Are = you referring to '!!key_idx'? I think this should be changed to '(key_idx != =3D EMPTY_SLOT)'.