From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756463AbYENV5Y (ORCPT ); Wed, 14 May 2008 17:57:24 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1751539AbYENV5O (ORCPT ); Wed, 14 May 2008 17:57:14 -0400 Received: from mail2.shareable.org ([80.68.89.115]:50070 "EHLO mail2.shareable.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751429AbYENV5N (ORCPT ); Wed, 14 May 2008 17:57:13 -0400 Date: Wed, 14 May 2008 22:57:06 +0100 From: Jamie Lokier To: Evgeniy Polyakov Cc: Jeff Garzik , Sage Weil , linux-kernel@vger.kernel.org, netdev@vger.kernel.org, linux-fsdevel@vger.kernel.org Subject: Re: POHMELFS high performance network filesystem. Transactions, failover, performance. Message-ID: <20080514215704.GE23758@shareable.org> Mail-Followup-To: Evgeniy Polyakov , Jeff Garzik , Sage Weil , linux-kernel@vger.kernel.org, netdev@vger.kernel.org, linux-fsdevel@vger.kernel.org References: <20080513174523.GA1677@2ka.mipt.ru> <4829E752.8030104@garzik.org> <20080513205114.GA16489@2ka.mipt.ru> <20080514135156.GA23131@2ka.mipt.ru> <482B378C.5070807@garzik.org> <20080514193843.GB10165@2ka.mipt.ru> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20080514193843.GB10165@2ka.mipt.ru> User-Agent: Mutt/1.5.13 (2006-08-11) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Evgeniy Polyakov wrote: > > Quite true, but IMO single-node performance is largely an academic > > exercise today. What production system is run without backups or > > replication? > > If cluster is made out of 2-3-4-10 machines, it does want to get maximum > single node performance. But I agree that in some cases we have to > sacrifice of something in order to find something new. And the larger > cluster becomes, for more things we can close eyes on. With the right topology and hardware, you can get _faster_ than single node performance with as many nodes as you like, except when there is a node/link failure and the network pauses briefly to reorganise - and even that is solvable. Consider: Client <-> A <-> B <-> C <-> D A to D are servers. <-> are independent network links. Each server has hardware which can forward a packet at the same time it's being received like the best switches (wormhole routing), while performing minor transformations on it (I did say the right hardware ;-) Client sends a request message. It is forwarded along the whole chain, and reaches D with just a few microseconds of delay compared with A. All servers process the message, and produce a response in about the same time. However, (think of RAID) they don't all process all data in the message, just part they are responsible for, so they might do it faster than a single node would processing the whole message. The aggregate response is a function of all of them. D sends its response. C forwards that packet while modifying the answer to include its own response. B, A do the same. The answer at Client arrives just a few microseconds later than it would have with just a single server. If desired, arrange it in a tree to reduce even the microseconds. Such network hardware is quite feasible, indeed quite easy with an FPGA based NIC. Enjoy the speed :-) -- Jamie