From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.5 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_PASS,USER_AGENT_MUTT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 74994C04EB8 for ; Thu, 6 Dec 2018 22:39:44 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 40EAF2146F for ; Thu, 6 Dec 2018 22:39:44 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 40EAF2146F Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726158AbeLFWjn (ORCPT ); Thu, 6 Dec 2018 17:39:43 -0500 Received: from mx1.redhat.com ([209.132.183.28]:44552 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725935AbeLFWjm (ORCPT ); Thu, 6 Dec 2018 17:39:42 -0500 Received: from smtp.corp.redhat.com (int-mx01.intmail.prod.int.phx2.redhat.com [10.5.11.11]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id C00BBC057F3A; Thu, 6 Dec 2018 22:39:41 +0000 (UTC) Received: from redhat.com (ovpn-122-74.rdu2.redhat.com [10.10.122.74]) by smtp.corp.redhat.com (Postfix) with ESMTPS id DB4F9600C4; Thu, 6 Dec 2018 22:39:37 +0000 (UTC) Date: Thu, 6 Dec 2018 17:39:35 -0500 From: Jerome Glisse To: Dave Hansen Cc: Logan Gunthorpe , linux-mm@kvack.org, Andrew Morton , linux-kernel@vger.kernel.org, "Rafael J . Wysocki" , Matthew Wilcox , Ross Zwisler , Keith Busch , Dan Williams , Haggai Eran , Balbir Singh , "Aneesh Kumar K . V" , Benjamin Herrenschmidt , Felix Kuehling , Philip Yang , Christian =?iso-8859-1?Q?K=F6nig?= , Paul Blinzer , John Hubbard , Ralph Campbell , Michal Hocko , Jonathan Cameron , Mark Hairgrove , Vivek Kini , Mel Gorman , Dave Airlie , Ben Skeggs , Andrea Arcangeli , Rik van Riel , Ben Woodard , linux-acpi@vger.kernel.org Subject: Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind() Message-ID: <20181206223935.GG3544@redhat.com> References: <20181205001544.GR2937@redhat.com> <42006749-7912-1e97-8ccd-945e82cebdde@intel.com> <20181205021334.GB3045@redhat.com> <20181205175357.GG3536@redhat.com> <20181206192050.GC3544@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.10.1 (2018-07-13) X-Scanned-By: MIMEDefang 2.79 on 10.5.11.11 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.32]); Thu, 06 Dec 2018 22:39:42 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Dec 06, 2018 at 02:04:46PM -0800, Dave Hansen wrote: > On 12/6/18 12:11 PM, Logan Gunthorpe wrote: > >> My concern with having folks do per-program parsing, *and* having a huge > >> amount of data to parse makes it unusable. The largest systems will > >> literally have hundreds of thousands of objects in /sysfs, even in a > >> single directory. That makes readdir() basically impossible, and makes > >> even open() (if you already know the path you want somehow) hard to do fast. > > Is this actually realistic? I find it hard to imagine an actual hardware > > bus that can have even thousands of devices under a single node, let > > alone hundreds of thousands. > > Jerome's proposal, as I understand it, would have generic "links". > They're not an instance of bus, but characterize a class of "link". For > instance, a "link" might characterize the characteristics of the QPI bus > between two CPU sockets. The link directory would enumerate the list of > all *instances* of that link > > So, a "link" directory for QPI would say Socket0<->Socket1, > Socket1<->Socket2, Socket1<->Socket2, Socket2<->PCIe-1.2.3.4 etc... It > would have to enumerate the connections between every entity that shared > those link properties. > > While there might not be millions of buses, there could be millions of > *paths* across all those buses, and that's what the HMAT describes, at > least: the net result of all those paths. Sorry if again i miss-explained thing. Link are arrows between nodes (CPU or device or memory). An arrow/link has properties associated with it: bandwidth, latency, cache-coherent, ... So if in your system you 4 Sockets and that each socket is connected to each other (mesh) and all inter-connect in the mesh have same property then you only have 1 link directory with the 4 socket in it. No if the 4 sockets are connect in a ring fashion ie: Socket0 - Socket1 | | Socket3 - Socket2 Then you have 4 links: link0: socket0 socket1 link1: socket1 socket2 link3: socket2 socket3 link4: socket3 socket0 I do not see how their can be an explosion of link directory, worse case is as many link directories as they are bus for a CPU/device/ target. So worse case if you have N devices and each devices is connected two 2 bus (PCIE and QPI to go to other socket for instance) then you have 2*N link directory (again this is a worst case). They are lot of commonality that will remain so i expect that quite a few link directory will have many symlink ie you won't get close to the worst case. In the end really it is easier to think from the physical topology and there a link correspond to an inter-connect between two device or CPU. In all the systems i have seen even in the craziest roadmap i have only seen thing like 128/256 inter-connect (4 socket 32/64 devices per socket) and many of which can be grouped under a common link directory. Here worse case is 4 connection per device/CPU/ target so worse case of 128/256 * 4 = 512/1024 link directory and that's a lot. Given regularity i have seen described on slides i expect that it would need something like 30 link directory and 20 bridges directory. On today system 8GPU per socket with GPUlink between each GPU and PCIE all this with 4 socket it comes down to 20 links directory. In any cases each devices/CPU/target has a limit on the number of bus/inter-connect it is connected too. I doubt there is anyone designing device that will have much more than 4 external bus connection. So it is not a link per pair. It is a link for group of device/CPU/ target. Is it any clearer ? Cheers, Jérôme