Opened 8 years ago

Last modified 3 years ago

#3700 closed defect

Running the NetClient in a separate thread — at Version 6

Reported by: elexis Owned by:
Priority: Must Have Milestone: Alpha 24
Component: Network Keywords: beta
Cc: andy011973@… Patch:

Description (last modified by elexis)

Since the NetClient is running in the same thread as the map generation, the loading screen can entirely block the client for a couple dozen of seconds, thus the client appears to be offline ("losing the connection") while joining or rejoining the game. This can cause the disconnect of the client while loading the game (some players can't rejoin anymore).

Some of the map generator calls are threaded, but it were more sustainable if the client is executed in a separate thread, avoiding repetition of this problem.


Original report (only one symptom of the underlying problem): Since a chat-message is displayed when a client has finished rejoining (#1949), one can easily observe a reoccuring, distinct type of lag ("finished-rejoin lag").

It happens whenever a game has been played for a while (much serialized data). The game appears to be frozen for 5 to 15 seconds.

Here is what happens:

  1. The client is starting to rejoin.
  2. The server sends the client a lot of serialized data.
  3. The client needs a significant time to deserialize it.
  4. Meanwhile the game has continued (i.e. new turns)
  5. In order to finish the rejoin, the client simulates the new turns while all other clients wait for him. This is the finish-rejoin lag.

With #3242 rejoins happen more often, so the problem accumulates more in that situation.

In order to solve or reduce the impact of the issue, the playing clients could progress one turn for every five turns that the rejoined client computes. Hence the rejoined client still catches up with the other clients, while those are still able to play.

Change History (6)

comment:1 by elexis, 8 years ago

The same issue is stated in NetServer.cpp of r10437 (in particular the part about clients not being happy):

bool CNetServerWorker::OnJoinSyncingLoadedGame(void* context, CFsmEvent* event)
{
	// A client rejoining an in-progress game has now finished loading the
	// map and deserialized the initial state.
	// The simulation may have progressed since then, so send any subsequent
	// commands to them and set them as an active player so they can participate
	// in all future turns.
	//
	// (TODO: if it takes a long time for them to receive and execute all these
	// commands, the other players will get frozen for that time and may be unhappy;
	// we could try repeating this process a few times until the client converges
	// on the up-to-date state, before setting them as active.)

comment:2 by elexis, 8 years ago

Component: Core engineNetwork

(set component to network)

comment:3 by elexis, 7 years ago

Priority: If Time PermitsMust Have

The problem is more severe then lag. Some players can't rejoin a game anymore and become disconnected due to extensive lag after the loading screen!

It's not necessarily the syncing after the map load, but also parts of the map loading process itself. It can be reproduced by starting a huge map in multiplayer mode and adding a LOGERROR("NetClientSession Poll %d", std::time(nullptr)); to CNetClientSession::Poll():

ERROR: NetClientSession Poll 1483494187
ERROR: NetClientSession Poll 1483494187
ERROR: NetClientSession Poll 1483494188
TIMER| ParseTerrain: 37.8426 ms
ERROR: NetClientSession Poll 1483494188
ERROR: NetClientSession Poll 1483494188
TIMER| ParseEntities: 17.3228 s
ERROR: NetClientSession Poll 1483494206
ERROR: NetClientSession Poll 1483494206
ERROR: NetClientSession Poll 1483494206
ERROR: NetClientSession Poll 1483494206
TIMER| common/modern/setup.xml: 140.615 us
TIMER| common/modern/styles.xml: 148.302 us
TIMER| common/modern/sprites.xml: 1.91044 ms
TIMER| common/setup.xml: 261.822 us
TIMER| common/setup_resources.xml: 77.32 us
TIMER| common/sprites.xml: 523.206 us
TIMER| common/styles.xml: 115.968 us
TIMER| session/sprites.xml: 4.15153 ms
TIMER| session/styles.xml: 195.216 us
TIMER| session/session.xml: 111.865 ms
TIMER| common/global.xml: 6.56328 ms
GAME STARTED, ALL INIT COMPLETE
ERROR: NetClientSession Poll 1483494220
ERROR: NetClientSession Poll 1483494222
TIMER| LoadDLL: 588.092 us
ERROR: NetClientSession Poll 1483494231
ERROR: NetClientSession Poll 1483494231

As we can see the ParseEntities part of the random map loading code already mutes the NetClient for 17 seconds, which can be sufficient to disconnect it. Two more undocumented tasks blocked the client for 14 seconds and 9 seconds respectively.

These tasks are apparently not being run in a separate thread like GenerateMap (and the entire server) do.

comment:4 by elexis, 7 years ago

Keywords: beta added
Milestone: BacklogAlpha 22

comment:5 by elexis, 7 years ago

TODO: We should likely thread the entire client instead of identifying the individual bottlenecks. Would be more sustainable as the design pattern avoids repetition of such bugs.

comment:6 by elexis, 7 years ago

Description: modified (diff)
Summary: Finish-rejoin lagRunning the NetClient in a separate thread
Note: See TracTickets for help on using tickets.