All of lore.kernel.org
 help / color / mirror / Atom feed
From: Yakov Lerner <iler.ml@gmail.com>
To: netdev@vger.kernel.org, Eric Dumazet <eric.dumazet@gmail.com>,
	David Miller <davem@davemloft.net>
Subject: Re: [PATCH] /proc/net/tcp, overhead removed
Date: Tue, 29 Sep 2009 01:10:18 +0300	[thread overview]
Message-ID: <f36b08ee0909281510y282d621etb4264ecd92cbe8f0@mail.gmail.com> (raw)
In-Reply-To: <4ABF360E.7080301@gmail.com>

On Sun, Sep 27, 2009 at 12:53, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Yakov Lerner a écrit :
>> /proc/net/tcp does 20,000 sockets in 60-80 milliseconds, with this patch.
>>
>> The overhead was in tcp_seq_start(). See analysis (3) below.
>> The patch is against Linus git tree (1). The patch is small.
>>
>> ------------  -----------   ------------------------------------
>> Before patch  After patch   20,000 sockets (10,000 tw + 10,000 estab)(2)
>> ------------  -----------   ------------------------------------
>> 6 sec          0.06 sec     dd bs=1k if=/proc/net/tcp >/dev/null
>> 1.5 sec        0.06 sec     dd bs=4k if=/proc/net/tcp >/dev/null
>>
>> 1.9 sec        0.16 sec     netstat -4ant >/dev/null
>> ------------  -----------   ------------------------------------
>>
>> This is ~ x25 improvement.
>> The new time is not dependent on read blockize.
>> Speed of netstat, naturally, improves, too; both -4 and -6.
>> /proc/net/tcp6 does 20,000 sockets in 100 millisec.
>>
>> (1) against git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
>>
>> (2) Used 'manysock' utility to stress system with large number of sockets:
>>   "manysock 10000 10000"    - 10,000 tw + 10,000 estab ip4 sockets.
>>   "manysock -6 10000 10000" - 10,000 tw + 10,000 estab ip6 sockets.
>> Found at http://ilerner.3b1.org/manysock/manysock.c
>>
>> (3) Algorithmic analysis.
>>     Old algorithm.
>>
>> During 'cat </proc/net/tcp', tcp_seq_start() is called O(numsockets) times (4).
>> On average, every call to tcp_seq_start() scans half the whole hashtable. Ouch.
>> This is O(numsockets * hashsize). 95-99% of 'cat </proc/net/tcp' is spent in
>> tcp_seq_start()->tcp_get_idx. This overhead is eliminated by new algorithm,
>> which is O(numsockets + hashsize).
>>
>>     New algorithm.
>>
>> New algorithms is O(numsockets + hashsize). We jump to the right
>> hash bucket in tcp_seq_start(), without scanning half the hash.
>> To jump right to the hash bucket corresponding to *pos in tcp_seq_start(),
>> we reuse three pieces of state (st->num, st->bucket, st->sbucket)
>> as follows:
>>  - we check that requested pos >= last seen pos (st->num), the typical case.
>>  - if so, we jump to bucket st->bucket
>>  - to arrive to the right item after beginning of st->bucket, we
>> keep in st->sbucket the position corresponding to the beginning of
>> bucket.
>>
>> (4) Explanation of O( numsockets * hashsize) of old algorithm.
>>
>> tcp_seq_start() is called once for every ~7 lines of netstat output
>> if readsize is 1kb, or once for every ~28 lines if readsize >= 4kb.
>> Since record length of /proc/net/tcp records is 150 bytes, formula for
>> number of calls to tcp_seq_start() is
>>             (numsockets * 150 / min(4096,readsize)).
>> Netstat uses 4kb readsize (newer versions), or 1kb (older versions).
>> Note that speed of old algorithm does not improve above 4kb blocksize.
>>
>> Speed of the new algorithm does not depend on blocksize.
>>
>> Speed of the new algorithm does not perceptibly depend on hashsize (which
>> depends on ramsize). Speed of old algorithm drops with bigger hashsize.
>>
>> (5) Reporting order.
>>
>> Reporting order is exactly same as before if hash does not change underfoot.
>> When hash elements come and go during report, reporting order will be
>> same as that of tcpdiag.
>>
>> Signed-off-by: Yakov Lerner <iler.ml@gmail.com>
>> ---
>>  net/ipv4/tcp_ipv4.c |   26 ++++++++++++++++++++++++--
>>  1 files changed, 24 insertions(+), 2 deletions(-)
>>
>> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
>> index 7cda24b..7d9421a 100644
>> --- a/net/ipv4/tcp_ipv4.c
>> +++ b/net/ipv4/tcp_ipv4.c
>> @@ -1994,13 +1994,14 @@ static inline int empty_bucket(struct tcp_iter_state *st)
>>               hlist_nulls_empty(&tcp_hashinfo.ehash[st->bucket].twchain);
>>  }
>>
>> -static void *established_get_first(struct seq_file *seq)
>> +static void *established_get_first_after(struct seq_file *seq, int bucket)
>>  {
>>       struct tcp_iter_state *st = seq->private;
>>       struct net *net = seq_file_net(seq);
>>       void *rc = NULL;
>>
>> -     for (st->bucket = 0; st->bucket < tcp_hashinfo.ehash_size; ++st->bucket) {
>> +     for (st->bucket = bucket; st->bucket < tcp_hashinfo.ehash_size;
>> +          ++st->bucket) {
>>               struct sock *sk;
>>               struct hlist_nulls_node *node;
>>               struct inet_timewait_sock *tw;
>> @@ -2036,6 +2037,11 @@ out:
>>       return rc;
>>  }
>>
>> +static void *established_get_first(struct seq_file *seq)
>> +{
>> +     return established_get_first_after(seq, 0);
>> +}
>> +
>>  static void *established_get_next(struct seq_file *seq, void *cur)
>>  {
>>       struct sock *sk = cur;
>> @@ -2045,6 +2051,7 @@ static void *established_get_next(struct seq_file *seq, void *cur)
>>       struct net *net = seq_file_net(seq);
>>
>>       ++st->num;
>> +     st->sbucket = st->num;
>
> Hello Yakov
>
> Intention of your patch is very good, but not currently working.
>
> It seems you believe there is at most one entry per hash slot or something like that
>
> Please reboot your test machine with "thash_entries=4096" so that tcp hash
> size is 4096, and try to fill 20000 tcp sockets with a test program.
>
> then :
>
> # ss | wc -l
> 20001
> (ok)
>
> # cat /proc/net/tcp | wc -l
> 22160
> (not quite correct ...)
>
> # netstat -tn | wc -l
> <never ends>
>
>
> # dd if=/proc/net/tcp ibs=1024 | wc -l
> <never ends>
>
>
> Please send your next patch on netdev@vger.kernel.org , DaveM only , were netdev people
> are reviewing netdev patches, there is no need include other people for first submissions.
>
> Thank you
>
>
> #include <sys/types.h>
> #include <sys/socket.h>
> #include <netinet/in.h>
> #include <string.h>
> int fdlisten;
> main()
> {
>        int i;
>        struct sockaddr_in sockaddr;
>
>        fdlisten = socket(AF_INET, SOCK_STREAM, 0);
>        memset(&sockaddr, 0, sizeof(sockaddr));
>        sockaddr.sin_family = AF_INET;
>        sockaddr.sin_port = htons(2222);
>        if (bind(fdlisten, (struct sockaddr *)&sockaddr, sizeof(sockaddr))== -1) {
>                perror("bind");
>                return 1;
>        }
>        if (listen(fdlisten, 10)== -1) {
>                perror("listen");
>                return 1;
>        }
>        if (fork() == 0) {
>                while (1) {
>                        socklen_t len = sizeof(sockaddr);
>                        int newfd = accept(fdlisten, (struct sockaddr *)&sockaddr, &len);
>                }
>        }
>        for (i = 0 ; i < 10000; i++) {
>                int fd = socket(AF_INET, SOCK_STREAM, 0);
>                if (fd == -1) {
>                        perror("socket");
>                        break;
>                        }
>                connect(fd, (struct sockaddr *)&sockaddr, sizeof(sockaddr));
>        }
>        pause();
> }
>

Hello Eric,

I found the problem, thanks. I'll re-send after testing.

In the meantime, I'd like to ask you whether it makes sense to
add the /proc/net entry, to switch between "old way" and "new way".
The switch would allow quick compare/test between new way and
old way not only by line count, but by full contents, without reboot.

Yakov

  reply	other threads:[~2009-09-28 22:10 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-09-26 21:31 [PATCH] /proc/net/tcp, overhead removed Yakov Lerner
2009-09-26 21:31 ` Yakov Lerner
2009-09-27  9:53 ` Eric Dumazet
2009-09-28 22:10   ` Yakov Lerner [this message]
2009-09-28 22:20     ` Eric Dumazet
2009-09-28 23:24       ` Stephen Hemminger
2009-09-29  7:43         ` Yakov Lerner
2009-09-28 23:01 Yakov Lerner
2009-09-29  4:39 ` Eric Dumazet
2009-09-29  7:56 ` Eric Dumazet
2009-09-29  8:55   ` Yakov Lerner
2009-09-29 15:45     ` Stephen Hemminger
2009-09-29 17:34       ` Yakov Lerner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f36b08ee0909281510y282d621etb4264ecd92cbe8f0@mail.gmail.com \
    --to=iler.ml@gmail.com \
    --cc=davem@davemloft.net \
    --cc=eric.dumazet@gmail.com \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.