netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [EXAMPLE CODE] Parasite thread injection and TCP connection hijacking
@ 2011-08-06 12:12 Tejun Heo
  2011-08-06 12:45 ` Andy Lutomirski
  2011-10-30  4:48 ` hiberante hangs TCP " David Fries
  0 siblings, 2 replies; 12+ messages in thread
From: Tejun Heo @ 2011-08-06 12:12 UTC (permalink / raw)
  To: Matt Helsley, Pavel Emelyanov, Nathan Lynch, Oren Laadan,
	Daniel Lezcano, S
  Cc: James E.J. Bottomley, David S. Miller, linux-kernel, netdev

[-- Attachment #1: Type: text/plain, Size: 23047 bytes --]

Hello, guys.

So, here's transparent TCP connection hijacking (ie. checkpointing in
one process and restoring in another) which adds only relatively small
pieces to the kernel.  It's by no means complete but already works
rather reliably in my test setup even with heavy delay induced with
tc.

I wrote a rather long README describing how it's working, what's
missing which is appended at the end of this mail so if you're
interested in the details please go ahead and read.

Several ioctls are added to enable TCP connection CR, which adds
around 130 lines of code.  Note that the interface is ugly.  As said
above, it's proof-of-concept.  We'll need a bit more information
exported and knobs to turn and hopefully prettier interface.

As my knowledge of networking is fairly rudimentary, I only tried to
get the basics working.  e.g. I didn't try to store negotiated options
and re-establish them on restoration (ie. window scaling, mss, various
extra features), and am likely to have made wrong assumptions even on
the basics.  If you spot some, please shout.

The source is available in the following git branch (just git clone
from the URL)

  https://htejun@code.google.com/p/ptrace-parasite/

and can be browsed at

  http://code.google.com/p/ptrace-parasite/source/browse/

I'm attaching the tcp-ioctls.patch and the source tarball.

Thanks.

--
tejun


Parasite thread injection and TCP connection hijacking example code
===================================================================

This example code is for linux >= 3.1-rc1 on x86_64.

The goal is to demonstrate the followings.

* Transparent injection of a thread into the target process using new
  ptrace commands - PTRACE_SEIZE and INTERRUPT.

* Using the injected thread to capture a TCP connection and restoring
  it in another process.

Both are primarily to serve as the start point for mostly userland
checkpoint-restart implementation.  The latter is likely to be of
interest to virtual server farms and high availability too.

The code contained here is by no means ready for production.  It's
more of proof-of-concept; however, I'll try to document the missing
pieces here and in the comments.


Organization
============

The following files are for the executable 'parasite'.

 main.c		The bulk of the example program which seizes target
		process, inject parasite and sequence it to hijack TCP
		connection.

 parasite.c	Self contained code which is compiled as PIC and
		injected into target process.  Parasite thread
		executes this code.

 parasite.h	Included by both main.c and parasite.c.  Defines
		protocol used between the main program and injected
		parasite thread.

 syscall.h	Syscall interface used by parasite.c.

 setup-nfqueue	Helper script to setup iptables rules to send packets
		for a specified connection to nfqueue.  Don't mix with
		working firewall setup.  Invoked by the executable
		while hijacking TCP connection and assumed to be in
		the same directory.

 flush-nfqueue	Naive script to reverse setup-nfqueue.  It just clears
		INPUT and OUTPUT tables.  Again, don't mix with
		working firewall setup.  Invoked by the executable
		while hijacking TCP connection and assumed to be in
		the same directory.

Build procedure is a bit unusual.  parasite.c is compiled as PIC,
linked into raw binary parasite.bin using parasite.lds and hexdumped
into C array 'char parasite_blob[]' in parasite-blob.h.  main.c
includes this file and the final executable embeds the parasite blob
in it.

The followings are example programs to serve as host to inject
parasite into.

 simple-host.c	Simple pthread program.  Five threads print out
		heartbeat messages each second.  Has simple SIGUSR1/2
		handler for signal testing.

 net-host.c	TCP connection test program.  Depending on parameter,
		it either listens for connection or connects as
		directed.  Once connection is established, both
		parties keep sending incrementing uint64_t and verify
		that received data is incrementing uint64_t's.

		Has bandwidth throttling on the receiver side.  This
		ensures that both local rx and remote tx queues are
		populated.

		SIGUSR1 injects uint64_t which isn't part of the
		sequence which will be detected and reported by the
		remote side.  This can be used for verification and
		measuring send(2) to recv(2) latency.

		SIGUSR2 tests SIOCGOUTSEQS and SIOCPEEKOUTQ.  Just
		added to verify kernel features.

tcp-ioctls.patch is a patch to implement extra TCP ioctls for
connection hijacking.  This will be discussed further later.
Applicable to kernel 3.1-rc1.


Parasite thread injection
=========================

ptrace provides access to full process memory space and register
states, so it could always have manipulated the tracee however it
pleased including making it executing arbitrary code.  Unfortunately,
the previously existing commands depended on signals to interrupt the
tracee and interaction with job control was both poorly designed and
implemented.  This meant that although ptrace could be used to inject
arbitrary code into tracee, it couldn't do that without affecting
signal and job control states.

Broken parts of ptrace have been fixed and three new ptrace requests
are available from kernel 3.1 under development flag (which is
scheduled to be removed from 3.2) - PTRACE_SEIZE, INTERRUPT and
LISTEN.  These new requests allow transparent monitoring and
manipulation of tracee.  Note that transparency is not absolute in the
sense that ptrace operations would behave as signal delivery attempt
which can affect execution of certain system calls; however, userland
is already mandated to handle the condition regardless of ptrace and
albeit not absolute it still is complete in the scope defined by the
API.

Once all threads of the target process are seized, tracer can execute
arbitrary code using the following sequence.

1. PTRACE_SEIZE and INTERRUPT all threads belonging to the target
   process.  This is implemented in main.c::seize_process().  As noted
   in the source, the implementation isn't complete.  Proper
   implementation requires verification and retries in case thread
   creations and/or destructions race against seizing.

   Once INTERRUPT is issued, tracee either stops at job control trap
   or in exec path.  If tracee is about to exec, there isn't much to
   do anyway, so the example code simply aborts in such cases.

   From now on, all operations are assumed to be performed on one of
   the threads (any thread will do).

2. Decide where to inject the foreign code and save the original code
   with PTRACE_PEEKDATA.  Tracer can poke any mapped area regardless
   of protection flags but it can't add execution permission to the
   code, so it needs to choose memory area which already has X flag
   set.  The example code uses the page the %rip is in.

   To allow synchronization, the foreign code raises debug trap (int3)
   after execution finishes.

3. Inject the foreign code using PTRACE_POKEDATA.  The foreign code
   would usually have to be position independent and self-contained.

   Note that this page will be modified with PTRACE_POKEDATA is likely
   to trigger COW.  If a process is to be manipulated multiple times,
   it might be beneficial to use the same page every time.

4. Acquire and save the current register states using PTRACE_GETREGS
   and modify the register states for execution with PTRACE_SETREGS.
   Among others, %rip should point to the start address of the
   injected code and %orig_rax should be set to -1 to avoid
   end-of-syscall processing while returning to userland (register
   state will be restored later and end-of-syscall processing should
   happen after that).

5. Issue PTRACE_CONT to let the tracee return to userland and execute
   the injected code.  Tracer wait(2)s for tracee to enter stop state.
   Only two things can happen - signal delivery or end of execution
   notification via int3.

   Signal delivery either changes job control stop state, kills the
   process or schedules userland signal handler.  Nothing special to
   do about the first two.  For userland signal handler scheduling,
   issuing PTRACE_INTERRUPT before telling tracee to deliver signal
   with PTRACE_CONT makes tracee to re-trap after userland signal
   handler is scheduled without actually executing any userland code.
   Once scheduling is complete, retry from #4.

   After successful execution, tracee would be trapped indicating
   SIGTRAP delivery.  Squash it and put tracee back into job control
   trap by first issuing PTRACE_INTERRUPT followed by PTRACE_CONT.

   This step is implemented in execute_blob().

6. Restore saved registers and memory, and PTRACE_DETACH from all
   threads.

As arbitrary syscall can be issued using injected code, it isn't
difficult to inject larger chunk of code and create a parasite thread
on it.  The example code blocks all signals, uses mmap to allocate
memory, fill it with parasite_blob[], creates the parasite thread
using clone(2) and let it execute the injected code.


EXECUTION EXAMPLE

Running simple-host in a session first and parasite on another yields
the following outputs.  The alphabets in the first column are
referenced below to explain what's going on.

  # ./simple-host
  thread 01(1330): alive
  thread 02(1331): alive
  thread 03(1332): alive
  thread 04(1333): alive
  thread 00(1329): alive
A BLOB: hello, world!
B PARASITE STARTED
C PARASITE SAY: ah ah! mic test!
  thread 01(1330): alive
  thread 02(1331): alive
  thread 03(1332): alive
  thread 00(1329): alive
  ...

  # parasite `pidof simple-host`
  Seizing 1329
  Seizing 1330
  Seizing 1331
  Seizing 1332
  Seizing 1333
A executing test blob
  blocking all signals = 0, prev_sigmask 0
  executing mmap blob = 0x7f16f3024000
  executing clone blob = 1336
B executing parasite
C waiting for connection... connected
  executing munmap blob = 0
  restoring sigmask = 0, prev_sigmask 0xfffffffffffbfeef

On <A>, simple test code blob which says hi to world is injected into
the host and executed by one of the host threads to verify blob
execution works.

A series of blobs are executed afterwards to prepare for thread
injection.  The parasite thread is created in the host and released
for execution on <B>.  The first thing it does is printing out STARTED
message.

After that the injected thread connects back to the main 'parasite'
program using a TCP connection, at which point it's directed to print
out mic test message via SAY command.  This is happening on <C>.

After that, the prep steps are reversed and the target process is
released to continue normal execution.

Job control and USR signals directed at the target process should
behave as expected (sans the extra latency introduced by parasite) no
matter when they are generated.


TCP connection hijacking
========================

This part is much less complete and really a proof-of-concept.  The
goal is to show that TCP connection can be checkpointed in one process
and restored in another with only small additions to the networking
stack.  This also demonstrates that, with parasite threads, most
information is already available to checkpointer and adding mechanisms
to extract and manipulate more states can be done in very
non-obtrusive manner by extending existing API.  There's no new
security, locking or visibility boundary issues.

Note that CR in many cases wouldn't need this transparent snapshotting
of TCP connections.  For example, when CRing whole distributed HPC
workload, there's no reason to maintain TCP details which aren't
visible to applications at all.  Checkpointing threads injected to
both ends can simply drain the connection using recv(2) and restore
them by opening a new connection and repopulating the send buffer on
the other side with send(2).  DMTCP already uses this method to CR TCP
connections.

In general, unless the target connection is going to be terminated
from the target process and restored somewhere else immediately
(connection migration), there is little point in saving and restoring
TCP details including send and receive buffers as they become invalid
as soon as the target process exchanges further packets with the peer
after the checkpointing, and, if the peer is being checkpointed
together, draining and repopulating from each end point as described
above is far better and simpler.

With the above said, the basic states of a TCP connection can be
checkpointed and restarted with the following extra ioctls.  Note that
these ioctls should be considered as proof-of-concept.

 SIOCGINSEQ	Determine TCP sequence to be read on the next
		recv(2) - ie. tp->copied_seq.

 SIOCGOUTSEQS	Determine TCP sequences scheduled for transmission in
		reverse order.  ie. If the seq after SYN was 6, and
		20, 30 and 40 byte packets are in the tx queue without
		receiving any ack, it would return 96, 56, 26 and 6.

 SIOCSOUTSEQ	Set initial TCP sequence to use when establishing a
		connection.  Only valid on a not-yet-connected or
		listening socket.  The next connection established
		will start with the specified sequence.

 SIOCPEEKOUTQ	Peek the content of tx queue.

 SIOCFORCEOUTBD	Force packet separation on the next send(2).
		ie. data from the next send(2) won't be merged into
		the same packet with currently queued data.

A TCP connection can be snapshotted using the following sequence.

s1. Seize target process and inject a parasite thread.

s2. Acquire basic target socket information - IPs and ports.

s3. Block both incoming and outgoing packets belonging to the
    connection.

s4. Acquire rx queue information - the sequence number of the next
    byte to be read and the content of recv buffer.  The former is
    available through SIOCGINSEQ and the latter with recvmsg(2) w/
    MSG_PEEK.

s5. Acquire tx queue information - the sequence numbers of all pending
    packets and the content of send buffer.  The former is available
    through SIOCGOUTSEQS and the latter SIOCPEEKOUTQ.

None of the above steps has irreversible side effect and the
connection can be safely resumed.  To restore the connection, the
following steps can be used.

r1. Packets for the connection are still blocked from s3.  Create a
    way to intercept those packets and inject packets - nf_queue works
    for the former and raw socket for the latter.  It should drop all
    packets other than the ones injected via raw socket.

r2. Create a TCP socket, set outgoing sequence with SIOCSOUTSEQ so
    that it matches the sequence number at the head of the stored send
    queue, and initiate connection.

r3. Upon intercepting SYN, inject SYN/ACK with the sequence number
    matching the head of the stored rx queue.

r4. Upon intercepting ACK reply for SYN/ACK, repopulate the rx queue
    from the stored copy by injecting data packets and waiting for
    ACKs.

r5. Repopulate tx queue with send(2) with interleaving SIOCFORCEOUTBD
    calls to preserve the original packet boundaries.

r6. Connection is ready now.  Let the packets pass through.

The following points are worth mentioning regarding the above
sequences.

* As long as queue information is acquired after packets are blocked,
  there's no danger of data loss due to race condition on both rx and
  tx queues.  If data is received after rx queue is stored, the ack
  wouldn't reach the peer, so it will be retransmitted.  If ack is
  received after tx queue is stored, it just has extra data which will
  be acked and discarded again later.

* Both recv and send buffers need to be blown up before repopulating
  them with stored data.  SO_RCV/SNDBUFFORCE are used for this which
  disabled automatic buffer sizing.  It would be nice if there's a way
  to tell the TCP stack to resume auto resizing afterwards.

* Packet boundaries in tx queue need to be preserved, at least between
  the tx queue head and tp->snd_nxt.  This is because queue
  restoration can result in different packet division and if the peer
  already had received some of the packets before, stream can't be
  resumed with sequences falling inside existing packets.  Note that
  having more divisions is fine as long as the original boundaries are
  still there.

* Another subtlety with tx queue is that the TCP socket needs to
  think that all packets which were transmitted by the original
  connection are already transmitted before the packet barrier comes
  down - ie. its tp->snd_nxt needs to be the same as or after the
  original tp->snd_nxt; otherwise, it might end up ignoring ACKs
  stalling the connection.

  This currently is achieved by advertising maximum window on injected
  response packets so that the TCP socket sends out all queued data
  immediately.  If this isn't a guaranteed behavior, it would make
  sense to provide a way to manipulate tp->snd_nxt.

* The above sequence makes the new socket connect(2) but it would be
  better to reverse the direction to enable restoring server
  connections with N:1 port mapping.

Note that the implemented example code is incomplete in the following
aspects.

* No URG handling.  As OOB data can be acquired inline with other
  data.  Adding a mechanism to export URG offset from the tail of
  queue should be enough.

* Assumes ESTABLISHED.  Proof-of-concept! I get to be lazy! :P

* Doesn't handle options properly during connection negotiation.  I
  was being lazy but also at the same time am not a network expert and
  can't tell which ones should do what.  Needs more trained eyes here.

* Connection faking isn't robust at all.  Again, needs more work and
  some love from network gurus.

* No IPv6.


EXECUTION EXAMPLE

Incomplete as it may be, the example implementation actually works
rather reliably.  The parasite needs to be run as root as it uses
SO_RCV/SNDBUFFORCE and executes setup-nfqueue and flush-nfqueue
scripts which manipulate netfilter tables (don't run it on your
production machine with working firewall).  Also, it assumes that the
peer of the target TCP connection is on a remote machine and only
packets injected via raw socket pass through the loopback device.

Two instances of net-host keep talking to each other on 10.7.7.1 and
10.7.8.1 verifying received stream is sequence of incrementing
uint64_t's.  We want to hijack the socket from the net-host instance
on 10.7.8.1 and splice ourselves inside the connection so that the end
result looks like the following.

 net-host on 10.7.7.1 <---> parasite on 10.7.8.1 <---> net-host on 10.7.8.1

On 10.7.7.1,

  # ./net-host 9999 1024
A Connected to 10.7.8.2:40986
H signal 10 si_code=0
  inserting contaminant @0x22f682
G foreign data @0x2a7500 : 0xdeadbeefbeefdead

On 10.7.8.1,

  # ./net-host 10.7.7.1:9999 1024
A Connected to 10.7.7.1:9999
  BLOB: hello, world!
  PARASITE STARTED
B PARASITE SAY: ah ah! mic test!
H foreign data @0x22f682 : 0xdeadbeefbeefdead
G signal 10 si_code=0
  inserting contaminant @0x2a7500

On a different session on 10.7.8.1,

  # ls -l /proc/`pidof net-host`/fd
  total 0
  lrwx------ 1 root root 64 Aug  6 12:45 0 -> /dev/ttyS0
  lrwx------ 1 root root 64 Aug  6 12:45 1 -> /dev/ttyS0
  lrwx------ 1 root root 64 Aug  6 12:45 2 -> /dev/ttyS0
  lrwx------ 1 root root 64 Aug  6 12:45 3 -> socket:[12199]
A # ./parasite `pidof net-host` 3
  Seizing 1388
  Seizing 1389
  executing test blob
  blocking all signals = 0, prev_sigmask 0
  executing mmap blob = 0x7fa1e7caa000
  executing clone blob = 1397
  executing parasite
B waiting for connection... connected
  target socket: 10.7.8.2:40986 -> 10.7.7.1:9999 in 65392@0x185a3439 out 31856@0xb0e08180
C peeked socket buffer in 65392 out 26064
  executing munmap blob = 0
  restoring sigmask = 0, prev_sigmask 0xfffffffffffbfeef
D restoring connection, connecting...
  pkt: R->L S 185b33a9 A b0e02158 D 00000 a___ DROP
  pkt: R->L S 185b33a9 A b0e02158 D 00000 a___ DROP
  pkt: L->R S b0e02158 A 185b33a9 D 00000 a__r DROP
  pkt: L->R S b0e01baf A 00000000 D 00000 _s__ DROP DONE
  got SYN, replying with SYN/ACK
  pkt: R->L S 185a3438 A b0e01bb0 D 00000 as__ ACPT
  pkt: L->R S b0e01bb0 A 185a3439 D 00000 a___ DROP DONE
  connection established, repopulating rx/tx queues
E pkt: R->L S 185a3439 A b0e01bb0 D 01360 a___ ACPT
  pkt: L->R S b0e01bb0 A 185a3989 D 00000 a___ DROP DONE
  pkt: R->L S 185a3989 A b0e01bb0 D 01360 a___ ACPT
  pkt: L->R S b0e01bb0 A 185a3ed9 D 00000 a___ DROP DONE
  ...
  pkt: R->L S 185b3339 A b0e01bb0 D 00112 a___ ACPT
  pkt: L->R S b0e01bb0 A 185b33a9 D 00000 a___ DROP DONE
F snd: ---- S b0e01bb0 A -------- D 00000
  snd: ---- S b0e01bb0 A -------- D 01448
  ...
  snd: ---- S b0e07bd8 A -------- D 01448
G connection restored
  pkt: L->R S b0e01bb0 A 185b33a9 D 01448 a___ ACPT
  pkt: L->R S b0e02158 A 185b33a9 D 01448 a___ ACPT
  ...
  pkt: L->R S b0e04e98 A 185b33a9 D 01448 a___ ACPT
  pkt: R->L S 185b3629 A b0e02158 D 00000 a___ ACPT
  pkt: R->L S 185b3629 A b0e02700 D 00000 a___ ACPT
  ...
  pkt: R->L S 185b3629 A b0e05440 D 00000 a___ ACPT

On yet another session on 10.7.7.1,

H # killall -USR1 net-host

On yet another session on 10.7.8.1,

G # killall -USR1 net-host

<A> Two net-host instances are connected to each other sending and
    verifying each other's stream.  From /proc the socket fd is
    determined to be 3 and parasite is executed to hijack the socket.

<B> Upto this point, it's the same as the previous thread injection
    example.  Parasite thread is injected and test command is
    executed.

<C> Snapshot steps s3 - s5 are executed and the fd 3 is dupped over by
    a connection to the main program.  This means on 10.7.8.1, the
    connection doesn't exist anymore; however, 10.7.7.1 doesn't know
    this as packets belonging to the connection are being dropped.

<D> Restoration steps r1 - r3 are executed.

<E> rx queue is being repopulated.

<F> tx queue is being repopulated.

<G> Connection restored and the main parasite program now owns the
    original connection to net-host on 10.7.7.1 and a new connection
    to net-host on 10.7.8.1.  It pipes data between the two.

<H> Verify data is still flowing by triggering net-host on 10.7.7.1 to
    insert foreign data in the stream, which is soon received by
    net-host on 10.7.8.1.

<G> Vice-versa from 10.7.8.1 to 10.7.7.1.


NOTES

* As said above, the ioctls are only proof-of-concept.  We'll probably
  need more information exported and maybe a few more ways to
  manipulate the states.  As long as the state manipulations stay out
  of usual stream processing - ie. they only affect connection setup,
  I don't think the added complexity or maintenance overhead would be
  noticeable.

* Having socket inode # match in iptables would solve the packet
  matching problem.  Note that even on a busy system, this connection
  intervention shouldn't add much overhead.  While not snapshotting,
  no firewall rule is needed.  During snapshotting, only packets of
  the target connections are ipqueue'd and ipset can match many
  connections without much overhead.

* While writing, I had more things I wanted to talk about in this
  section but apparently forgot them. :( I'll add as they come back.

[-- Attachment #2: parasite-20110806.tar.gz --]
[-- Type: application/x-gzip, Size: 25006 bytes --]

[-- Attachment #3: tcp-ioctls.patch --]
[-- Type: text/plain, Size: 6650 bytes --]

 include/linux/sockios.h |    6 ++
 include/linux/tcp.h     |    2 
 net/ipv4/tcp.c          |  113 +++++++++++++++++++++++++++++++++++++++++++++++-
 net/ipv4/tcp_ipv4.c     |   16 ++++--
 4 files changed, 130 insertions(+), 7 deletions(-)

diff --git a/include/linux/sockios.h b/include/linux/sockios.h
index 7997a50..f5c3e41 100644
--- a/include/linux/sockios.h
+++ b/include/linux/sockios.h
@@ -127,6 +127,12 @@
 /* hardware time stamping: parameters in linux/net_tstamp.h */
 #define SIOCSHWTSTAMP   0x89b0
 
+#define SIOCGINSEQ	0x89b1		/* get copied_seq */
+#define SIOCGOUTSEQS	0x89b2		/* get seqs for pending tx pkts */
+#define SIOCSOUTSEQ	0x89b3		/* set write_seq */
+#define SIOCPEEKOUTQ	0x89b4		/* peek output queue */
+#define SIOCFORCEOUTBD	0x89b5		/* force output packet boundary */
+
 /* Device private ioctl calls */
 
 /*
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 531ede8..c0945fe 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -365,6 +365,8 @@ struct tcp_sock {
 	u32	snd_up;		/* Urgent pointer		*/
 
 	u8	keepalive_probes; /* num of allowed keep alive probes	*/
+	u8	wseq_set    : 1;/* Write sequence set via setsockopt	*/
+	u8	force_outbd : 1;/* force packet boundary on next send	*/
 /*
  *      Options received (usually on last packet, some only on SYN packets).
  */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 46febca..3389827 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -464,12 +464,118 @@ unsigned int tcp_poll(struct file *file, struct socket *sock, poll_table *wait)
 }
 EXPORT_SYMBOL(tcp_poll);
 
+static int tcp_get_out_seqs(struct sock *sk, u32 __user *p, int size)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct sk_buff *skb;
+	int pos = 0, cnt = size / sizeof(u32);
+
+	if (pos < cnt && put_user(tp->write_seq, &p[pos++]))
+		return -EFAULT;
+
+	skb_queue_reverse_walk(&sk->sk_write_queue, skb) {
+		struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
+
+		if (pos < cnt && put_user(tcb->seq, &p[pos++]))
+			return -EFAULT;
+	}
+	return pos * sizeof(u32);
+}
+
+static int tcp_peek_outq(struct sock *sk, void __user *arg, int size)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct iovec iov = { .iov_base = arg, .iov_len = size };
+	struct sk_buff *skb;
+	int copied = 0, err = 0;
+	int outq, skip;
+
+	lock_sock(sk);
+
+	/* XXX: why doesn't SIOCOUTQ[NSD] account for queued fin? */
+	outq = tp->write_seq - tp->snd_una;
+	skb = skb_peek_tail(&sk->sk_write_queue);
+	if (outq && skb)
+		outq -= tcp_hdr(skb)->fin;
+
+	skip = outq - min(size, outq);
+
+	skb_queue_walk(&sk->sk_write_queue, skb) {
+		int off = 0, todo;
+
+		if (skip) {
+			off = min_t(int, skip, skb->len);
+			skip -= off;
+		}
+
+		if (!(todo = skb->len - off))
+			continue;
+
+		if (WARN_ON_ONCE(iov.iov_len < todo)) {
+			err = -EINVAL;
+			break;
+		}
+
+		err = skb_copy_datagram_iovec(skb, off, &iov, todo);
+		if (err)
+			break;
+		copied += todo;
+	}
+
+	release_sock(sk);
+
+	return err ?: copied;
+}
+
 int tcp_ioctl(struct sock *sk, int cmd, unsigned long arg)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	int answ;
 
 	switch (cmd) {
+	case SIOCGOUTSEQS: {
+		s32 size;
+
+		if (get_user(size, (s32 __user *)arg))
+			return -EFAULT;
+		if (size < 0)
+			return -EINVAL;
+		return tcp_get_out_seqs(sk, (u32 __user *)arg, size);
+	}
+	case SIOCSOUTSEQ: {
+		u32 seq;
+
+		if (get_user(seq, (u32 __user *)arg))
+			return -EFAULT;
+
+		lock_sock(sk);
+		answ = -EISCONN;
+		if ((sk->sk_socket->state == SS_UNCONNECTED &&
+		     sk->sk_state == TCP_CLOSE) || sk->sk_state == TCP_LISTEN) {
+			tp->write_seq = seq;
+			tp->wseq_set = true;
+			answ = 0;
+		}
+		release_sock(sk);
+		return answ;
+	}
+	case SIOCPEEKOUTQ: {
+		u32 size;
+
+		if (get_user(size, (u32 __user *)arg))
+			return -EFAULT;
+		if ((int)size < size)
+			return -EINVAL;
+		return tcp_peek_outq(sk, (void __user *)arg, size);
+	}
+	case SIOCFORCEOUTBD:
+		lock_sock(sk);
+		tp->force_outbd = true;
+		release_sock(sk);
+		return 0;
+	}
+
+	switch (cmd) {
 	case SIOCINQ:
 		if (sk->sk_state == TCP_LISTEN)
 			return -EINVAL;
@@ -514,6 +620,9 @@ int tcp_ioctl(struct sock *sk, int cmd, unsigned long arg)
 		else
 			answ = tp->write_seq - tp->snd_nxt;
 		break;
+	case SIOCGINSEQ:
+		answ = tp->copied_seq;
+		break;
 	default:
 		return -ENOIOCTLCMD;
 	}
@@ -965,7 +1074,7 @@ int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 				copy = max - skb->len;
 			}
 
-			if (copy <= 0) {
+			if (copy <= 0 || unlikely(tp->force_outbd)) {
 new_segment:
 				/* Allocate new segment. If the interface is SG,
 				 * allocate skb fitting to single page.
@@ -979,6 +1088,8 @@ new_segment:
 				if (!skb)
 					goto wait_for_memory;
 
+				tp->force_outbd = false;
+
 				/*
 				 * Check whether we can use HW checksum.
 				 */
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 955b8e6..579234c 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -201,7 +201,8 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 		/* Reset inherited state */
 		tp->rx_opt.ts_recent	   = 0;
 		tp->rx_opt.ts_recent_stamp = 0;
-		tp->write_seq		   = 0;
+		if (!tp->wseq_set)
+			tp->write_seq      = 0;
 	}
 
 	if (tcp_death_row.sysctl_tw_recycle &&
@@ -252,12 +253,12 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	sk->sk_gso_type = SKB_GSO_TCPV4;
 	sk_setup_caps(sk, &rt->dst);
 
-	if (!tp->write_seq)
+	if (!tp->write_seq && !tp->wseq_set)
 		tp->write_seq = secure_tcp_sequence_number(inet->inet_saddr,
 							   inet->inet_daddr,
 							   inet->inet_sport,
 							   usin->sin_port);
-
+	tp->wseq_set = false;
 	inet->inet_id = tp->write_seq ^ jiffies;
 
 	err = tcp_connect(sk);
@@ -1252,7 +1253,7 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 		if (net_ratelimit())
 			syn_flood_warning(skb);
 #ifdef CONFIG_SYN_COOKIES
-		if (sysctl_tcp_syncookies) {
+		if (sysctl_tcp_syncookies && !tp->wseq_set) {
 			want_cookie = 1;
 		} else
 #endif
@@ -1334,7 +1335,10 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 	if (!want_cookie || tmp_opt.tstamp_ok)
 		TCP_ECN_create_request(req, tcp_hdr(skb));
 
-	if (want_cookie) {
+	if (unlikely(tp->wseq_set)) {
+		isn = tp->write_seq;
+		tp->wseq_set = false;
+	} else if (want_cookie) {
 		isn = cookie_v4_init_sequence(sk, skb, &req->mss);
 		req->cookie_ts = tmp_opt.tstamp_ok;
 	} else if (!isn) {
@@ -1526,7 +1530,7 @@ static struct sock *tcp_v4_hnd_req(struct sock *sk, struct sk_buff *skb)
 	}
 
 #ifdef CONFIG_SYN_COOKIES
-	if (!th->syn)
+	if (!th->syn && !tcp_sk(sk)->wseq_set)
 		sk = cookie_v4_check(sk, skb, &(IPCB(skb)->opt));
 #endif
 	return sk;

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [EXAMPLE CODE] Parasite thread injection and TCP connection hijacking
  2011-08-06 12:12 [EXAMPLE CODE] Parasite thread injection and TCP connection hijacking Tejun Heo
@ 2011-08-06 12:45 ` Andy Lutomirski
  2011-08-06 13:00   ` Tejun Heo
  2011-10-30  4:48 ` hiberante hangs TCP " David Fries
  1 sibling, 1 reply; 12+ messages in thread
From: Andy Lutomirski @ 2011-08-06 12:45 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Matt Helsley, Pavel Emelyanov, Nathan Lynch, Oren Laadan,
	Daniel Lezcano, S, James E.J. Bottomley, David S. Miller,
	linux-kernel, netdev

On 08/06/2011 08:12 AM, Tejun Heo wrote:
> Hello, guys.
>
> So, here's transparent TCP connection hijacking (ie. checkpointing in
> one process and restoring in another) which adds only relatively small
> pieces to the kernel.  It's by no means complete but already works
> rather reliably in my test setup even with heavy delay induced with
> tc.
>
> I wrote a rather long README describing how it's working, what's
> missing which is appended at the end of this mail so if you're
> interested in the details please go ahead and read.

That's a little gross but quite cool.

I think you have an annoying corner case, though:

 > 2. Decide where to inject the foreign code and save the original code
 >    with PTRACE_PEEKDATA.  Tracer can poke any mapped area regardless
 >    of protection flags but it can't add execution permission to the
 >    code, so it needs to choose memory area which already has X flag
 >    set.  The example code uses the page the %rip is in.

If the process is executing from the vsyscall page, then you'll probably 
fail.  (Admittedly, this is rather unlikely, given that the vsyscalls 
are now exactly one instruction.)  Presumably you also fail if executing 
from a read-only MAP_SHARED mapping.

Windows has a facility to more-or-less call mmap on behalf of another 
process, and another one to directly inject a thread into a remote 
process.  It's traditional to use them for this type of manipulation. 
Perhaps Linux should get the same thing.  (Although you could accomplish 
much the same thing if you could create a task with your mm but the 
tracee's fs.)

--Andy

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [EXAMPLE CODE] Parasite thread injection and TCP connection hijacking
  2011-08-06 12:45 ` Andy Lutomirski
@ 2011-08-06 13:00   ` Tejun Heo
  2011-08-06 13:15     ` Andrew Lutomirski
  2011-08-08 10:20     ` Tejun Heo
  0 siblings, 2 replies; 12+ messages in thread
From: Tejun Heo @ 2011-08-06 13:00 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Matt Helsley, Pavel Emelyanov, Nathan Lynch, Oren Laadan,
	Daniel Lezcano, S, James E.J. Bottomley, David S. Miller,
	linux-kernel, netdev

Hello,

On Sat, Aug 06, 2011 at 08:45:28AM -0400, Andy Lutomirski wrote:
> > 2. Decide where to inject the foreign code and save the original code
> >    with PTRACE_PEEKDATA.  Tracer can poke any mapped area regardless
> >    of protection flags but it can't add execution permission to the
> >    code, so it needs to choose memory area which already has X flag
> >    set.  The example code uses the page the %rip is in.
> 
> If the process is executing from the vsyscall page, then you'll
> probably fail.  (Admittedly, this is rather unlikely, given that the
> vsyscalls are now exactly one instruction.)  Presumably you also
> fail if executing from a read-only MAP_SHARED mapping.

Heh, yeah, I originally thought about scanning /proc/PID/maps to look
for the page to use but was lazy and just used %rip.  I think that
should work.  I'll note the problem in README.

> Windows has a facility to more-or-less call mmap on behalf of
> another process, and another one to directly inject a thread into a
> remote process.  It's traditional to use them for this type of
> manipulation. Perhaps Linux should get the same thing.  (Although
> you could accomplish much the same thing if you could create a task
> with your mm but the tracee's fs.)

Actually, the only thing we need on x86_64 is two bytes for the
syscall instruction because all params are passed through registers
anyway.  We can just set up parameters for mmap, turn on single step,
point %rip to syscall in the vsyscall page.  So, either way, I don't
think this would be too difficult to solve.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [EXAMPLE CODE] Parasite thread injection and TCP connection hijacking
  2011-08-06 13:00   ` Tejun Heo
@ 2011-08-06 13:15     ` Andrew Lutomirski
  2011-08-06 13:20       ` Tejun Heo
  2011-08-08 10:20     ` Tejun Heo
  1 sibling, 1 reply; 12+ messages in thread
From: Andrew Lutomirski @ 2011-08-06 13:15 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Matt Helsley, Pavel Emelyanov, Nathan Lynch, Oren Laadan,
	Daniel Lezcano, S, James E.J. Bottomley, David S. Miller,
	linux-kernel, netdev

On Sat, Aug 6, 2011 at 9:00 AM, Tejun Heo <tj@kernel.org> wrote:
> Hello,
>
> On Sat, Aug 06, 2011 at 08:45:28AM -0400, Andy Lutomirski wrote:
>> > 2. Decide where to inject the foreign code and save the original code
>> >    with PTRACE_PEEKDATA.  Tracer can poke any mapped area regardless
>> >    of protection flags but it can't add execution permission to the
>> >    code, so it needs to choose memory area which already has X flag
>> >    set.  The example code uses the page the %rip is in.
>>
>> If the process is executing from the vsyscall page, then you'll
>> probably fail.  (Admittedly, this is rather unlikely, given that the
>> vsyscalls are now exactly one instruction.)  Presumably you also
>> fail if executing from a read-only MAP_SHARED mapping.
>
> Heh, yeah, I originally thought about scanning /proc/PID/maps to look
> for the page to use but was lazy and just used %rip.  I think that
> should work.  I'll note the problem in README.
>
>> Windows has a facility to more-or-less call mmap on behalf of
>> another process, and another one to directly inject a thread into a
>> remote process.  It's traditional to use them for this type of
>> manipulation. Perhaps Linux should get the same thing.  (Although
>> you could accomplish much the same thing if you could create a task
>> with your mm but the tracee's fs.)
>
> Actually, the only thing we need on x86_64 is two bytes for the
> syscall instruction because all params are passed through registers
> anyway.  We can just set up parameters for mmap, turn on single step,
> point %rip to syscall in the vsyscall page.  So, either way, I don't
> think this would be too difficult to solve.

Not any more -- that syscall instruction is gone as of 3.1.  You could
search through the vdso to find a syscall, but that seems fragile.

Why not just add a ptrace command to issue a syscall?

--Andy

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [EXAMPLE CODE] Parasite thread injection and TCP connection hijacking
  2011-08-06 13:15     ` Andrew Lutomirski
@ 2011-08-06 13:20       ` Tejun Heo
  0 siblings, 0 replies; 12+ messages in thread
From: Tejun Heo @ 2011-08-06 13:20 UTC (permalink / raw)
  To: Andrew Lutomirski
  Cc: Matt Helsley, Pavel Emelyanov, Nathan Lynch, Oren Laadan,
	Daniel Lezcano, S, James E.J. Bottomley, David S. Miller,
	linux-kernel, netdev

Hello,

On Sat, Aug 06, 2011 at 09:15:45AM -0400, Andrew Lutomirski wrote:
> On Sat, Aug 6, 2011 at 9:00 AM, Tejun Heo <tj@kernel.org> wrote:
> > Actually, the only thing we need on x86_64 is two bytes for the
> > syscall instruction because all params are passed through registers
> > anyway.  We can just set up parameters for mmap, turn on single step,
> > point %rip to syscall in the vsyscall page.  So, either way, I don't
> > think this would be too difficult to solve.
> 
> Not any more -- that syscall instruction is gone as of 3.1.  You could
> search through the vdso to find a syscall, but that seems fragile.
> 
> Why not just add a ptrace command to issue a syscall?

Yeah, maybe.  If this thing proves to be useful enough and looking for
a page to poke under proc too cumbersome.  I'm not against it but
don't really see strong need either at this point.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [EXAMPLE CODE] Parasite thread injection and TCP connection hijacking
  2011-08-06 13:00   ` Tejun Heo
  2011-08-06 13:15     ` Andrew Lutomirski
@ 2011-08-08 10:20     ` Tejun Heo
  1 sibling, 0 replies; 12+ messages in thread
From: Tejun Heo @ 2011-08-08 10:20 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Matt Helsley, Pavel Emelyanov, Nathan Lynch, Oren Laadan,
	Daniel Lezcano, James E.J. Bottomley, David S. Miller,
	linux-kernel, netdev

On Sat, Aug 06, 2011 at 03:00:37PM +0200, Tejun Heo wrote:
> Hello,
> 
> On Sat, Aug 06, 2011 at 08:45:28AM -0400, Andy Lutomirski wrote:
> > > 2. Decide where to inject the foreign code and save the original code
> > >    with PTRACE_PEEKDATA.  Tracer can poke any mapped area regardless
> > >    of protection flags but it can't add execution permission to the
> > >    code, so it needs to choose memory area which already has X flag
> > >    set.  The example code uses the page the %rip is in.
> > 
> > If the process is executing from the vsyscall page, then you'll
> > probably fail.  (Admittedly, this is rather unlikely, given that the
> > vsyscalls are now exactly one instruction.)  Presumably you also
> > fail if executing from a read-only MAP_SHARED mapping.
> 
> Heh, yeah, I originally thought about scanning /proc/PID/maps to look
> for the page to use but was lazy and just used %rip.  I think that
> should work.  I'll note the problem in README.

Okay, updated README.

  http://code.google.com/p/ptrace-parasite/source/browse/README

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 12+ messages in thread

* hiberante hangs TCP Re: [EXAMPLE CODE] Parasite thread injection and TCP connection hijacking
  2011-08-06 12:12 [EXAMPLE CODE] Parasite thread injection and TCP connection hijacking Tejun Heo
  2011-08-06 12:45 ` Andy Lutomirski
@ 2011-10-30  4:48 ` David Fries
  2011-10-30 20:16   ` Tejun Heo
  1 sibling, 1 reply; 12+ messages in thread
From: David Fries @ 2011-10-30  4:48 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, netdev

On Sat, Aug 06, 2011 at 02:12:47PM +0200, Tejun Heo wrote:
> Hello, guys.
> 
> So, here's transparent TCP connection hijacking (ie. checkpointing in
> one process and restoring in another) which adds only relatively small
> pieces to the kernel.  It's by no means complete but already works
> rather reliably in my test setup even with heavy delay induced with
> tc.

I saw the write up on this on lwn.net, pretty creative by the way, and
it got me thinking about a different checkpoint/restart problem I've
been running into.  Specifically in hibernating to disk.  In the
hibernate case active TCP connections hang after resuming, while an
idle TCP connection will continue after the system is back up.  My
observation is the kernel checkpoints itself to memory, enables
devices, writes out that checkpoint image to storage, then powers off.
The problem is if TCP packets are received while writing to storage,
the kernel will continue to queue and ack those TCP packets, but the
running kernel and it's network state is shortly lost.  When the
computer resumes, those TCP byte sequences hang the TCP connection for
an extended period of time while the resumed computer refuses to
acknowledge the data that was received after checkpointing and the now
running kernel knew nothing about, and the other computer tries in
vain to resend any data that hadn't yet been acknowledged, which is
always after the data that was lost, until one of them eventually
gives up.

I've been wondering if it was safe or possible to leave any network
interfaces down after the checkpoint, or what the right solution would
be.  I didn't think marking every TCP connection with a ZOMBIE_KERNEL
bit just after the kernel checkpoint (for the kernel is walking dead
and won't remember anything that happens), and then prevent any TCP
acks from being sent for those connections would be the right
solution.  I've taken to unplugging the physical lan cable,
hibernating to disk, and plugging it back in after the system is down,
to avoid the problem.  Any ideas?

-- 
David Fries <david@fries.net>    PGP pub CB1EE8F0
http://fries.net/~david/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: hiberante hangs TCP Re: [EXAMPLE CODE] Parasite thread injection and TCP connection hijacking
  2011-10-30  4:48 ` hiberante hangs TCP " David Fries
@ 2011-10-30 20:16   ` Tejun Heo
  2011-10-30 20:43     ` David Fries
  2011-11-02  9:44     ` MyungJoo Ham
  0 siblings, 2 replies; 12+ messages in thread
From: Tejun Heo @ 2011-10-30 20:16 UTC (permalink / raw)
  To: David Fries; +Cc: netdev, linux-pm, linux-kernel

(cc'ing Rafael and linux-pm)

On Sat, Oct 29, 2011 at 11:48:21PM -0500, David Fries wrote:
> I saw the write up on this on lwn.net, pretty creative by the way, and
> it got me thinking about a different checkpoint/restart problem I've
> been running into.  Specifically in hibernating to disk.  In the
> hibernate case active TCP connections hang after resuming, while an
> idle TCP connection will continue after the system is back up.  My
> observation is the kernel checkpoints itself to memory, enables
> devices, writes out that checkpoint image to storage, then powers off.
> The problem is if TCP packets are received while writing to storage,
> the kernel will continue to queue and ack those TCP packets, but the
> running kernel and it's network state is shortly lost.  When the
> computer resumes, those TCP byte sequences hang the TCP connection for
> an extended period of time while the resumed computer refuses to
> acknowledge the data that was received after checkpointing and the now
> running kernel knew nothing about, and the other computer tries in
> vain to resend any data that hadn't yet been acknowledged, which is
> always after the data that was lost, until one of them eventually
> gives up.
> 
> I've been wondering if it was safe or possible to leave any network
> interfaces down after the checkpoint, or what the right solution would
> be.  I didn't think marking every TCP connection with a ZOMBIE_KERNEL
> bit just after the kernel checkpoint (for the kernel is walking dead
> and won't remember anything that happens), and then prevent any TCP
> acks from being sent for those connections would be the right
> solution.  I've taken to unplugging the physical lan cable,
> hibernating to disk, and plugging it back in after the system is down,
> to avoid the problem.  Any ideas?

Hmmm... sounds like taking down network interfaces before starting
hibernation sequence should be enough, which shouldn't be too
difficult to implement from userland.  Rafael, what do you think?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: hiberante hangs TCP Re: [EXAMPLE CODE] Parasite thread injection and TCP connection hijacking
  2011-10-30 20:16   ` Tejun Heo
@ 2011-10-30 20:43     ` David Fries
  2011-11-02  9:44     ` MyungJoo Ham
  1 sibling, 0 replies; 12+ messages in thread
From: David Fries @ 2011-10-30 20:43 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-kernel, netdev, Rafael J. Wysocki, linux-pm

On Sun, Oct 30, 2011 at 01:16:18PM -0700, Tejun Heo wrote:
> (cc'ing Rafael and linux-pm)
> 
> On Sat, Oct 29, 2011 at 11:48:21PM -0500, David Fries wrote:
> > I saw the write up on this on lwn.net, pretty creative by the way, and
> > it got me thinking about a different checkpoint/restart problem I've
> > been running into.  Specifically in hibernating to disk.  In the
> > hibernate case active TCP connections hang after resuming, while an
> > idle TCP connection will continue after the system is back up.  My
> > observation is the kernel checkpoints itself to memory, enables
> > devices, writes out that checkpoint image to storage, then powers off.
> > The problem is if TCP packets are received while writing to storage,
> > the kernel will continue to queue and ack those TCP packets, but the
> > running kernel and it's network state is shortly lost.  When the
> > computer resumes, those TCP byte sequences hang the TCP connection for
> > an extended period of time while the resumed computer refuses to
> > acknowledge the data that was received after checkpointing and the now
> > running kernel knew nothing about, and the other computer tries in
> > vain to resend any data that hadn't yet been acknowledged, which is
> > always after the data that was lost, until one of them eventually
> > gives up.
> > 
> > I've been wondering if it was safe or possible to leave any network
> > interfaces down after the checkpoint, or what the right solution would
> > be.  I didn't think marking every TCP connection with a ZOMBIE_KERNEL
> > bit just after the kernel checkpoint (for the kernel is walking dead
> > and won't remember anything that happens), and then prevent any TCP
> > acks from being sent for those connections would be the right
> > solution.  I've taken to unplugging the physical lan cable,
> > hibernating to disk, and plugging it back in after the system is down,
> > to avoid the problem.  Any ideas?
> 
> Hmmm... sounds like taking down network interfaces before starting
> hibernation sequence should be enough, which shouldn't be too
> difficult to implement from userland.  Rafael, what do you think?

What I observe is the kernel prints out "Preallocating image memory",
then when the screen goes blank the network link light also goes out,
then the screen comes back on with "Compressing and saving" along with
the link light comes on, until it has been saved and the system shuts
down.  So the kernel is already brining the network down, it just
needs to keep it there until the original check pointed kernel is back
up.

Userspace bringing the network interfaces down is problematic.  As an
example one of my systems is running hostapd as an access point and
bridging that to the wired ethernet, that's not a trivial task to
setup and take down (the Debian ifup can set it up, but I've not
figured out yet how to get ifdown to take everything down cleanly, and
I sometimes manually run hostapd if I'm troubleshooting).  Any
manually added routes would go away, good luck in setting everything
back up the way it was before for all the different configurations out
there in userspace.  Add to those issues programs would now have a
time when networking is down that they wouldn't have otherwise seen.

-- 
David Fries <david@fries.net>    PGP pub CB1EE8F0
http://fries.net/~david/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: hiberante hangs TCP Re: [EXAMPLE CODE] Parasite thread injection and TCP connection hijacking
  2011-10-30 20:16   ` Tejun Heo
  2011-10-30 20:43     ` David Fries
@ 2011-11-02  9:44     ` MyungJoo Ham
  2011-11-02 15:10       ` Tejun Heo
  1 sibling, 1 reply; 12+ messages in thread
From: MyungJoo Ham @ 2011-11-02  9:44 UTC (permalink / raw)
  To: Tejun Heo; +Cc: netdev, linux-pm, David Fries, linux-kernel

On Mon, Oct 31, 2011 at 5:16 AM, Tejun Heo <tj@kernel.org> wrote:
> (cc'ing Rafael and linux-pm)
>
> On Sat, Oct 29, 2011 at 11:48:21PM -0500, David Fries wrote:
>> I saw the write up on this on lwn.net, pretty creative by the way, and
>> it got me thinking about a different checkpoint/restart problem I've
>> been running into.  Specifically in hibernating to disk.  In the
>> hibernate case active TCP connections hang after resuming, while an
>> idle TCP connection will continue after the system is back up.  My
>> observation is the kernel checkpoints itself to memory, enables
>> devices, writes out that checkpoint image to storage, then powers off.
>> The problem is if TCP packets are received while writing to storage,
>> the kernel will continue to queue and ack those TCP packets, but the
>> running kernel and it's network state is shortly lost.  When the
>> computer resumes, those TCP byte sequences hang the TCP connection for
>> an extended period of time while the resumed computer refuses to
>> acknowledge the data that was received after checkpointing and the now
>> running kernel knew nothing about, and the other computer tries in
>> vain to resend any data that hadn't yet been acknowledged, which is
>> always after the data that was lost, until one of them eventually
>> gives up.
>>
>> I've been wondering if it was safe or possible to leave any network
>> interfaces down after the checkpoint, or what the right solution would
>> be.  I didn't think marking every TCP connection with a ZOMBIE_KERNEL
>> bit just after the kernel checkpoint (for the kernel is walking dead
>> and won't remember anything that happens), and then prevent any TCP
>> acks from being sent for those connections would be the right
>> solution.  I've taken to unplugging the physical lan cable,
>> hibernating to disk, and plugging it back in after the system is down,
>> to avoid the problem.  Any ideas?
>
> Hmmm... sounds like taking down network interfaces before starting
> hibernation sequence should be enough, which shouldn't be too
> difficult to implement from userland.  Rafael, what do you think?
>
> Thanks.

Um... it seems that the "thaw" callbacks of network interfaces or TCP
should do something on this.

Probably, the "thaw" callbacks should make sure that the TCP
connections are closed?



Cheers,
MyungJoo


>
> --
> tejun
> _______________________________________________
> linux-pm mailing list
> linux-pm@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/linux-pm
>



-- 
MyungJoo Ham, Ph.D.
Mobile Software Platform Lab, DMC Business, Samsung Electronics

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: hiberante hangs TCP Re: [EXAMPLE CODE] Parasite thread injection and TCP connection hijacking
  2011-11-02  9:44     ` MyungJoo Ham
@ 2011-11-02 15:10       ` Tejun Heo
  2012-02-17 19:28         ` Pavel Machek
  0 siblings, 1 reply; 12+ messages in thread
From: Tejun Heo @ 2011-11-02 15:10 UTC (permalink / raw)
  To: MyungJoo Ham; +Cc: netdev, linux-pm, David Fries, linux-kernel

Hello,

On Wed, Nov 02, 2011 at 06:44:31PM +0900, MyungJoo Ham wrote:
> > Hmmm... sounds like taking down network interfaces before starting
> > hibernation sequence should be enough, which shouldn't be too
> > difficult to implement from userland.  Rafael, what do you think?
> >
> > Thanks.
> 
> Um... it seems that the "thaw" callbacks of network interfaces or TCP
> should do something on this.
> 
> Probably, the "thaw" callbacks should make sure that the TCP
> connections are closed?

I don't think it's a good idea to diddle with TCP connections from
that layer.  From what I understand, it seem all we need is plugging
tx/rx while preparing for hibernation.  That shouldn't be too
difficult.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: hiberante hangs TCP Re: [EXAMPLE CODE] Parasite thread injection and TCP connection hijacking
  2011-11-02 15:10       ` Tejun Heo
@ 2012-02-17 19:28         ` Pavel Machek
  0 siblings, 0 replies; 12+ messages in thread
From: Pavel Machek @ 2012-02-17 19:28 UTC (permalink / raw)
  To: Tejun Heo; +Cc: netdev, linux-pm, linux-kernel, David Fries

On Wed 2011-11-02 08:10:39, Tejun Heo wrote:
> Hello,
> 
> On Wed, Nov 02, 2011 at 06:44:31PM +0900, MyungJoo Ham wrote:
> > > Hmmm... sounds like taking down network interfaces before starting
> > > hibernation sequence should be enough, which shouldn't be too
> > > difficult to implement from userland.  Rafael, what do you think?
> > >
> > > Thanks.
> > 
> > Um... it seems that the "thaw" callbacks of network interfaces or TCP
> > should do something on this.
> > 
> > Probably, the "thaw" callbacks should make sure that the TCP
> > connections are closed?
> 
> I don't think it's a good idea to diddle with TCP connections from
> that layer.  From what I understand, it seem all we need is plugging
> tx/rx while preparing for hibernation.  That shouldn't be too
> difficult.

Yes, that should be done. 

If someone has uswsusp setup where they talk over the network, it
might break them, but hopefully noone is doing that.

Also hopefully noone does hibernation on /dev/nbd.

      	    							Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2012-02-17 19:28 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-08-06 12:12 [EXAMPLE CODE] Parasite thread injection and TCP connection hijacking Tejun Heo
2011-08-06 12:45 ` Andy Lutomirski
2011-08-06 13:00   ` Tejun Heo
2011-08-06 13:15     ` Andrew Lutomirski
2011-08-06 13:20       ` Tejun Heo
2011-08-08 10:20     ` Tejun Heo
2011-10-30  4:48 ` hiberante hangs TCP " David Fries
2011-10-30 20:16   ` Tejun Heo
2011-10-30 20:43     ` David Fries
2011-11-02  9:44     ` MyungJoo Ham
2011-11-02 15:10       ` Tejun Heo
2012-02-17 19:28         ` Pavel Machek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).