Opened 4 years ago

Closed 4 years ago

Last modified 3 years ago

#3264 closed enhancement (fixed)

[PATCH] Better timeout handling

Reported by: elexis Owned by: elexis
Priority: Should Have Milestone: Alpha 20
Component: Network Keywords: patch
Cc: Patch:

Description (last modified by elexis)

Refs #69 Enhance Multiplayer Experience, #2854 Extensive lag may cause disconnections

Problem: When a client drops, the game hangs for up to 30 seconds sometimes. The game appears to be paused and no one knows what's going on (although you can be sure that someone is dropping after like 10 seconds).

Fix: (1) Some seconds after the last turn was processed, a message should be displayed for all clients that we are waiting for. (2) The maximum timeout duration could be reduced from 30 to 10-20 seconds.


Reproduce timeout pauses: Without modifying 0ad, I could reproduce a 5 to 8 second timeout, where the game appears to be paused for that duration. You need to start a game with another client and then kill that client with

kill -9 <pid>

To get the correct process ID use:

pidof pyrogenesis

Or just use this command to kill the 0ad instance with the highest PID (usually the second instance):

kill -9 `pidof pyrogenesis | tr ' ' '\n' | sort -r | head -n 1`

Following the timeout model of enet (see below), the 30 second timeout can only be reproduced if you have actual lag.


Technical Background: Currently the client network timeouts are not set in 0ad, i.e. the default values of the enet library are used. They are hardcoded in enet.h:

   ENET_PEER_TIMEOUT_LIMIT                = 32,
   ENET_PEER_TIMEOUT_MINIMUM              = 5000,
   ENET_PEER_TIMEOUT_MAXIMUM              = 30000,

The documentation doesn't have too much information. The code can be obtained from http://enet.bespin.org/Downloads.html.

In peer.c you can find the following method to change the timeout settings for each peer:

/** Sets the timeout parameters for a peer.

    The timeout parameter control how and when a peer will timeout from a failure to acknowledge
    reliable traffic. Timeout values use an exponential backoff mechanism, where if a reliable
    packet is not acknowledge within some multiple of the average RTT plus a variance tolerance, 
    the timeout will be doubled until it reaches a set limit. If the timeout is thus at this
    limit and reliable packets have been sent but not acknowledged within a certain minimum time 
    period, the peer will be disconnected. Alternatively, if reliable packets have been sent
    but not acknowledged for a certain maximum time period, the peer will be disconnected regardless
    of the current timeout limit value.
    
    @param peer the peer to adjust
    @param timeoutLimit the timeout limit; defaults to ENET_PEER_TIMEOUT_LIMIT if 0
    @param timeoutMinimum the timeout minimum; defaults to ENET_PEER_TIMEOUT_MINIMUM if 0
    @param timeoutMaximum the timeout maximum; defaults to ENET_PEER_TIMEOUT_MAXIMUM if 0
*/

Therefore timeouts can be changed in NetServer.cppin the function CNetServerWorker::RunStep, for the case ENET_EVENT_TYPE_CONNECT by adding for example:

enet_peer_timeout(event.peer, 0, 0, 10000);

The timeout logic can be found in protocol.c.

If I understand the matter correctly, the client will be disconnected after 5 seconds if he had good latency before disconnecting. If he already had bad latency then enet will wait up to 30 seconds.

Attachments (5)

t3264_show_network_warnings_WIP_v1.patch (26.5 KB) - added by elexis 4 years ago.
Shows a warning in the top right corner of the screen in gamesetup and session if a client is losing the connection or having a bad ping (< turn length).
screenshot_ping.jpg (404.6 KB) - added by elexis 4 years ago.
screenshot_timeout.jpg (203.6 KB) - added by elexis 4 years ago.
video_ping.gif (423.7 KB) - added by elexis 4 years ago.
t3264_show_network_warnings_v1.patch (34.0 KB) - added by elexis 4 years ago.

Download all attachments as: .zip

Change History (16)

comment:1 Changed 4 years ago by elexis

Description: modified (diff)

comment:2 Changed 4 years ago by elexis

Actually it would be bad if we reduced the timeout duration. It would only result in more disconnects, since the performance is not optimized currently. If people are informed if the game is paused (#1950) or waiting for someone to timeout, then the issue should be resolved.

That line should still be added as a comment, so that it is easier to adapt the timeout to the performance in future revisions:

// enet_peer_timeout(event.peer, 0, 0, 10000);

Also the game should display a different notification if a player quits the game on purpose, so that you can distinguish it from a network disconnect.

comment:3 Changed 4 years ago by elexis

The hosting player should never be able to timeout.

If for some reason you had a timeout (i.e. lag for more than 30 seconds) it would kill the whole server.

Some reasons for such a huge timeout include sunken ship lag (r16636) or having 0ad on an external harddrive that goes to standby mode after starting the game and then needing more than 30 seconds to load later.

Moved to #3423

Last edited 4 years ago by elexis (previous) (diff)

comment:4 Changed 4 years ago by elexis

In 17710:

Show a more specific disconnect-message in case the host closed the server. Refs #3264, #3570.

Changed 4 years ago by elexis

Shows a warning in the top right corner of the screen in gamesetup and session if a client is losing the connection or having a bad ping (< turn length).

Changed 4 years ago by elexis

Attachment: screenshot_ping.jpg added

Changed 4 years ago by elexis

Attachment: screenshot_timeout.jpg added

Changed 4 years ago by elexis

Attachment: video_ping.gif added

comment:5 Changed 4 years ago by elexis

Reproduce bad latency and timeouts

With this bash script which uses traffic control it is possible to simulate arbitrary lag.

function setlag()
{
	for device in "eth0"; do
		sudo tc qdisc del dev $device root netem;
		if [ $1 -ne 0 ]; then
			sudo tc qdisc add dev $device root netem delay ${1}ms;
		fi
	done
}

The attached patch shows a warning in the top right corner of the screen in case the latency is worse than 110% of a turnlength (500ms) since in that case the game will be lagging notably.

Another warning is shown in case a client loses the connection (after 4 seconds). While the client is timing out, the game will not progress (for up to 30 seconds) before the client is disconnected.

These message give the host the possibility to kick the lagger or start a new game without that client.

Notice:

  • If the client detects bad latency / disconnect, it will trigger a local warning.
  • If the server detects bad latency / disconnect, it will send a packet to inform all other clients.

The packet CClientPerformanceMessage can be reused for a future feature to show the latencies of all clients.

Preview: http://trac.wildfiregames.com/raw-attachment/ticket/3264/video_ping.gif http://trac.wildfiregames.com/raw-attachment/ticket/3264/screenshot_ping.jpg http://trac.wildfiregames.com/raw-attachment/ticket/3264/screenshot_timeout.jpg

comment:6 Changed 4 years ago by elexis

Keywords: patch review added
Milestone: BacklogAlpha 20
Summary: Better timeout handling[PATCH] Better timeout handling

Updates in the patch:

Makes netwarnings optional, but enabled by default. Correctly sizes the textbox by exposing CalculateStringSize to the ScriptFunctions. Checks for bad ping in gamesetup by assuming the turnlength to be constant. Uses u32 for the RTT in the client-performance packet, since the client of the host should be able to freeze for more than 65 seconds without being disconnected (see #2854). Don't throw errors when receiving the new packet types while loading or synchronizing the game. std::time_t instead of u32 for std::time. Early return instead of ternary operator. Use extern instead of defining constants in header files. Don't report while the loading screen is visible. Don't process warnings if that client has been disconnected already (case "Unknown Player").

Changed 4 years ago by elexis

comment:7 Changed 4 years ago by elexis

After the netwarnings, three more message types can be added to explain all known distinct lagtypes:

The notification area can be reused to show other warnings that impact on the simulation rate. To explain to the user why the game is not advancing, those three message types should be added too:

  • Finish-rejoin lag: #3700 A network warning should be shown if the client needs more than 4 seconds to switch from NCS_JOIN_SYNCING to NCS_INGAME.
  • Pausing #1950: Pausing is currently indistinguishable from lag. This notification area would in my opinion be a good place to inform the user who paused.
  • Performance warnings:
    • The option should become a dropdown option with 3 possible values (no warnings, network warnings, all warnings).
    • If all warnings are enabled, it should show a message if a client needs more than turn_length to compute the last turn.
    • This will be quite spammy since the engine can't run 1000 units on a map flawlessly. The default setting should therefore only show network warnings.

(Therefore it should be renamed from netwarnings to gamestate-notifications)

comment:8 Changed 4 years ago by elexis

Owner: set to elexis
Resolution: fixed
Status: newclosed

In 17730:

Implement network-warnings, fixes #3264.
Shows a notification if the local client or other players connections timeout or have bad latency.

comment:9 Changed 4 years ago by elexis

Keywords: review removed

comment:10 Changed 4 years ago by elexis

In 18117:

Network cleanup.

Only send the network-warnings to clients that successfully joined the gamesetup or game.
Thus save a bit performance and
prevents a rare, harmless FSM update error (like in refs #3199) in case a client received the message while authenticating.
Reported by fatherbushido, refs #3264.

comment:11 Changed 3 years ago by elexis

Component: Core engineNetwork

(changed component to network)

Note: See TracTickets for help on using tickets.