I need a Linux TCP stack guru

I am looking for someone who knows the internals of the TCP implementation on Linux (2.6.10 or thereabouts). Here's a brief overview of the issue I'm trying to resolve:

Background: I'm trying to optimize transfers over a local GigE connection. The Linux machine (MIPS) is supposed to send 500K+ of data using a single send() function from the test application. The socket buffer size is set to more than 1MB. Nagle is disabled (not that it should matter in this case). I've essentially disabled congestion control by initializing tcp_cwnd to something like 128. I've done everything I can think of to make sure the kernel and/or TCP stack have no reason to do anything but send this chunk of TCP data as fast as possible.

Problem: Whenever the Linux TCP stack receives a packet from the peer indicating a larger window size, it seems to cause a delay of about 350 microseconds before additional TCP processing occurs on this connection. This occurs BEFORE the peer's window ever gets too small for the Linux machine to stop filling it, so it's not that the window closed and Linux had to stop sending data to the peer.

Analysis: Doing the math, this chunk should be able to be transferred in under 5 milli- seconds (really, closer to 4 msec). Instead, it's taking around 20 msec. There are 41 of these window opening delay events in my test transfer, adding at least 15 msec to the transfer time.

I don't know if I've explained this as clearly as I'd like. I could really use a quick chat with someone who knows the workings of the Linux stack inside and out (especially with regards to congestion control and ACK/ window processing).

Patrick ========= For LAN/WAN Protocol Analysis, check out PacketView Pro! ========= Patrick Klos Email: snipped-for-privacy@klos.com Klos Technologies, Inc. Web:

formatting link
====================
formatting link
====================

Reply to
Patrick Klos
Loading thread data ...

are you being bit by tcp's slow start feature here. TCP connections do a slow start just in case the connection crosses a congested link, so that it doesn't make the situation worse. After some epriod with a good acks and good RTT TCP winds up to full throughput.

It's known problem with TCP on very fast uncongested networks, and can restrict tcp throughputs. It also hits apps where there are lots and lots of small tcp sessions (like the web :-().

Check out rfc2001, google returns loads of refs.

Patrick Klos wrote:

Reply to
Jim Jackson

Thanks for the reply. Although slow start may also be involved, I determined that the primary reason I was seeing such delays was due to interrupt coalescing. When I disabled interrupt coalescing on the ethernet adapter, my transfer times became consistantly shorter.

I'll check that out. I'm still seeing symptoms that appear to be slow- start-like but they don't happen all the time. Does Linux TCP "remember" congestion information on a per-interface basis rather then on a per- connection basis?

Patrick ========= For LAN/WAN Protocol Analysis, check out PacketView Pro! ========= Patrick Klos Email: snipped-for-privacy@klos.com Klos Technologies, Inc. Web:

formatting link
====================
formatting link
====================

Reply to
Patrick Klos

"remember"

Can't see how it can do. It might cache connection info by destination just in case there are multiple tcp sessions to same end point - it sounds like it would be a neat optimisation - but sorry, I'm no Linux Kernel TCP gearhead, so dunno. What kenrel version you using?

Reply to
Jim Jackson

It's kept in a metrics portion of the routing cache. It's based on broader route selection criteria, not interface. Stored metrics includes things like rtt, cwnd, initial cwnd, send threshold, pmtu, negotiated mss, etc. TCP also has per-connection state of course. Storing metrics in the routing tables seems pretty common, I know several other TCP implementations that do the same (e.g. Sun Solaris, at least as of a few years ago). This is the obvious way of doing it, since the route picked greatly affects network behavior, and two connections to the same address can end up with different routes, so may need different metrics.

Reply to
Jan Brittenson

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.