Re: [PATCH] /proc/net/tcp, overhead removed

From: Eric Dumazet <eric.dumazet@gmail.com>
To: Yakov Lerner <iler.ml@gmail.com>
Cc: netdev@vger.kernel.org, davem@davemloft.net
Subject: Re: [PATCH] /proc/net/tcp, overhead removed
Date: Tue, 29 Sep 2009 06:39:17 +0200	[thread overview]
Message-ID: <4AC18F75.3090402@gmail.com> (raw)
In-Reply-To: <1254178906-5293-1-git-send-email-iler.ml@gmail.com>

Yakov Lerner a écrit :
> Take 2. 
> 
> "Sharp improvement in performance of /proc/net/tcp when number of 
> sockets is large and hashsize is large. 
> O(numsock * hashsize) time becomes O(numsock + hashsize). On slow
> processors, speed difference can be x100 and more."
> 
> I must say that I'm not fully satisfied with my choice of "st->sbucket" 
> for the new preserved index. The better name would be "st->snum". 
> Re-using "st->sbucket" saves 4 bytes, and keeps the patch to one sourcefile.
> But "st->sbucket" has different meaning in OPENREQ and LISTEN states;
> this can be confusing. 
> Maybe better add "snum" member to struct tcp_iter_state ?

You can add more fields to tcp_iter_state if it makes code more easy to read
and faster.

This structure is allocated once at open("/proc/net/tcp") time and could
be any reasonable size. You can add 10 longs in it, it is not a big deal.

> 
> Shall I change subject when sending "take N+1", or keep the old subject ?

Not a big deal, but keeping old subject is probably the common way.

[PATCH v2] tcp: Remove /proc/net/tcp O(N*H) overhead

> 
> Signed-off-by: Yakov Lerner <iler.ml@gmail.com>
> ---
>  net/ipv4/tcp_ipv4.c |   35 +++++++++++++++++++++++++++++++++--
>  1 files changed, 33 insertions(+), 2 deletions(-)
> 
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 7cda24b..e4c4f19 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -1994,13 +1994,14 @@ static inline int empty_bucket(struct tcp_iter_state *st)
>  		hlist_nulls_empty(&tcp_hashinfo.ehash[st->bucket].twchain);
>  }
>  
> -static void *established_get_first(struct seq_file *seq)
> +static void *established_get_first_after(struct seq_file *seq, int bucket)
>  {
>  	struct tcp_iter_state *st = seq->private;
>  	struct net *net = seq_file_net(seq);
>  	void *rc = NULL;
>  
> -	for (st->bucket = 0; st->bucket < tcp_hashinfo.ehash_size; ++st->bucket) {
> +	for (st->bucket = bucket; st->bucket < tcp_hashinfo.ehash_size;
> +	     ++st->bucket) {
>  		struct sock *sk;
>  		struct hlist_nulls_node *node;
>  		struct inet_timewait_sock *tw;
> @@ -2010,6 +2011,8 @@ static void *established_get_first(struct seq_file *seq)
>  		if (empty_bucket(st))
>  			continue;
>  

> +		st->sbucket = st->num;
> +

oh this is ugly...

Check tcp_seq_stop() to see why st->sbucket should not change after getting
lock. Any reader of this will have a heart attack :)

>  		spin_lock_bh(lock);
>  		sk_nulls_for_each(sk, node, &tcp_hashinfo.ehash[st->bucket].chain) {
>  			if (sk->sk_family != st->family ||
> @@ -2036,6 +2039,11 @@ out:
>  	return rc;
>  }
>  
> +static void *established_get_first(struct seq_file *seq)
> +{
> +	return established_get_first_after(seq, 0);
> +}
> +
>  static void *established_get_next(struct seq_file *seq, void *cur)
>  {
>  	struct sock *sk = cur;
> @@ -2064,6 +2072,9 @@ get_tw:
>  		while (++st->bucket < tcp_hashinfo.ehash_size &&
>  				empty_bucket(st))
>  			;
> +
> +		st->sbucket = st->num;

same here, this is ugly, even if it happens to work.

> +
>  		if (st->bucket >= tcp_hashinfo.ehash_size)
>  			return NULL;
>  
> @@ -2107,6 +2118,7 @@ static void *tcp_get_idx(struct seq_file *seq, loff_t pos)
>  
>  	if (!rc) {
>  		st->state = TCP_SEQ_STATE_ESTABLISHED;
> +		st->sbucket = 0;
>  		rc	  = established_get_idx(seq, pos);
>  	}
>  
> @@ -2116,6 +2128,25 @@ static void *tcp_get_idx(struct seq_file *seq, loff_t pos)
>  static void *tcp_seq_start(struct seq_file *seq, loff_t *pos)
>  {
>  	struct tcp_iter_state *st = seq->private;
> +
> +	if (*pos && *pos >= st->sbucket &&
> +	    (st->state == TCP_SEQ_STATE_ESTABLISHED ||
> +	     st->state == TCP_SEQ_STATE_TIME_WAIT)) {
> +		void *cur;
> +		int nskip;
> +
> +		/* for states estab and tw, st->sbucket is index (*pos) */
> +		/* corresponding to the beginning of bucket st->bucket */
> +
> +		st->num = st->sbucket;
ugly...
> +		/* jump to st->bucket, then skip (*pos - st->sbucket) items */
> +		st->state = TCP_SEQ_STATE_ESTABLISHED;
> +		cur = established_get_first_after(seq, st->bucket);
> +		for (nskip = *pos - st->num; cur && nskip > 0; --nskip)
> +			cur = established_get_next(seq, cur);
> +		return cur;
> +	}
> +

I dont think you need this chunk in tcp_get_start(), and its also probably buggy,
even if its hard to prove this claim, we'll need some prog to get TIME_WAIT sockets
in a reproducable form.

Jumping to the right hash slot is more than enough to avoid the O(N*H) problem.

You should try to optimize both established/listening algos, so that
code is readable and maintenable. On pathological cases, we can also have 10000
sockets in LISTENING/OPENREQ state.

Maybe we need a first patch to cleanup code, since its a really complex one,
then a patch to optimize it ?

IMHO the /proc/net/tcp file suffers from bugs, before a performance problem.

Currently, we can miss to output some live sockets in the dump, if :

Thread A gets a block from /proc/net/tcp and stops in hash slot N, socket X.
Thread B deletes sockets X, before socket Y in hash chain, or any socket
in previous hash slots.
Thread A gets 'next block', missing socket Y and possibly Y+1, Y+2....

-> Thread A doesnt see socket Y as an established/timewait socket.

So I believe being able to store the hash slot could really help both performance and
avoid skiping lot of sockets in case a thread B destroys sockets 'before our cursor'

The remaining window would be small, as only deleting sockets in our hash slot could
make us skip live sockets. (And closing this hole is really tricky, inet_diag has
same problem I believe)

Following program to establish 10000 sockets in listening state, and 2*10000 in
established state. Non random ports so that we can compare before/after patches.

#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <string.h>
#include <unistd.h>
#include <stdio.h>

int fdlisten[10000];
#define PORT 2222
int main(int argc, char *argv[])
{
        int i;
        struct sockaddr_in sockaddr, locaddr;

        for (i = 0; i < 10000; i++) {
                fdlisten[i] = socket(AF_INET, SOCK_STREAM, 0);
                memset(&sockaddr, 0, sizeof(sockaddr));
                sockaddr.sin_family = AF_INET;
                sockaddr.sin_port = htons(PORT);
                sockaddr.sin_addr.s_addr = htonl(0x7f000001 + i);
                if (bind(fdlisten[i], (struct sockaddr *)&sockaddr, sizeof(sockaddr))== -1) {
                        perror("bind");
                        return 1;
                }
                if (listen(fdlisten[i], 1)== -1) {
                        perror("listen");
                        return 1;
                }
        }
        if (fork() == 0) {
                i = 0;
                while (1) {
                        socklen_t len = sizeof(sockaddr);
                        int newfd = accept(fdlisten[i++], (struct sockaddr *)&sockaddr, &len);

                        if (newfd == -1)
                                perror("accept");
                        if (i == 10000)
                                i = 0;
                }
        }
        for (i = 0 ; i < 10000; i++) {
                int fd;

                close(fdlisten[i]);
                fd = socket(AF_INET, SOCK_STREAM, 0);
                if (fd == -1) {
                        perror("socket");
                        break;
                        }
                memset(&locaddr, 0, sizeof(locaddr));
                locaddr.sin_family = AF_INET;
                locaddr.sin_port = htons(i + 20000);
                locaddr.sin_addr.s_addr = htonl(0x7f000001 + i);
                bind(fd, (struct sockaddr *)&locaddr, sizeof(locaddr));

                memset(&sockaddr, 0, sizeof(sockaddr));
                sockaddr.sin_family = AF_INET;
                sockaddr.sin_port = htons(PORT);
                sockaddr.sin_addr.s_addr = htonl(0x7f000001 + i);
                connect(fd, (struct sockaddr *)&sockaddr, sizeof(sockaddr));
        }
        pause();
        return 0;
}