Opened 8 years ago

Last modified 3 years ago

#3700 closed defect

[PATCH] Running the NetClient in a separate thread — at Version 17

Reported by: elexis Owned by: Andy Alt
Priority: Must Have Milestone: Alpha 24
Component: Network Keywords: beta
Cc: andy011973@… Patch:

Description (last modified by elexis)

Since the NetClient is running in the same thread as the map generation, the loading screen can entirely block the client for a couple dozen of seconds, thus the client appears to be offline ("losing the connection") while joining or rejoining the game. This can cause the disconnect of the client while loading the game (some players can't rejoin anymore).

Some of the map generator calls are threaded, but it were more sustainable if the client is executed in a separate thread, avoiding repetition of this problem.


Original report (only one symptom of the underlying problem): Since a chat-message is displayed when a client has finished rejoining (#1949), one can easily observe a reoccuring, distinct type of lag ("finished-rejoin lag").

It happens whenever a game has been played for a while (much serialized data). The game appears to be frozen for 5 to 15 seconds.

Here is what happens:

  1. The client is starting to rejoin.
  2. The server sends the client a lot of serialized data.
  3. The client needs a significant time to deserialize it.
  4. Meanwhile the game has continued (i.e. new turns)
  5. In order to finish the rejoin, the client simulates the new turns while all other clients wait for him. This is the finish-rejoin lag.

With #3242 rejoins happen more often, so the problem accumulates more in that situation.

In order to solve or reduce the impact of the issue, the playing clients could progress one turn for every five turns that the rejoined client computes. Hence the rejoined client still catches up with the other clients, while those are still able to play.

Change History (25)

comment:1 by elexis, 8 years ago

The same issue is stated in NetServer.cpp of r10437 (in particular the part about clients not being happy):

bool CNetServerWorker::OnJoinSyncingLoadedGame(void* context, CFsmEvent* event)
{
	// A client rejoining an in-progress game has now finished loading the
	// map and deserialized the initial state.
	// The simulation may have progressed since then, so send any subsequent
	// commands to them and set them as an active player so they can participate
	// in all future turns.
	//
	// (TODO: if it takes a long time for them to receive and execute all these
	// commands, the other players will get frozen for that time and may be unhappy;
	// we could try repeating this process a few times until the client converges
	// on the up-to-date state, before setting them as active.)

comment:2 by elexis, 8 years ago

Component: Core engineNetwork

(set component to network)

comment:3 by elexis, 7 years ago

Priority: If Time PermitsMust Have

The problem is more severe then lag. Some players can't rejoin a game anymore and become disconnected due to extensive lag after the loading screen!

It's not necessarily the syncing after the map load, but also parts of the map loading process itself. It can be reproduced by starting a huge map in multiplayer mode and adding a LOGERROR("NetClientSession Poll %d", std::time(nullptr)); to CNetClientSession::Poll():

ERROR: NetClientSession Poll 1483494187
ERROR: NetClientSession Poll 1483494187
ERROR: NetClientSession Poll 1483494188
TIMER| ParseTerrain: 37.8426 ms
ERROR: NetClientSession Poll 1483494188
ERROR: NetClientSession Poll 1483494188
TIMER| ParseEntities: 17.3228 s
ERROR: NetClientSession Poll 1483494206
ERROR: NetClientSession Poll 1483494206
ERROR: NetClientSession Poll 1483494206
ERROR: NetClientSession Poll 1483494206
TIMER| common/modern/setup.xml: 140.615 us
TIMER| common/modern/styles.xml: 148.302 us
TIMER| common/modern/sprites.xml: 1.91044 ms
TIMER| common/setup.xml: 261.822 us
TIMER| common/setup_resources.xml: 77.32 us
TIMER| common/sprites.xml: 523.206 us
TIMER| common/styles.xml: 115.968 us
TIMER| session/sprites.xml: 4.15153 ms
TIMER| session/styles.xml: 195.216 us
TIMER| session/session.xml: 111.865 ms
TIMER| common/global.xml: 6.56328 ms
GAME STARTED, ALL INIT COMPLETE
ERROR: NetClientSession Poll 1483494220
ERROR: NetClientSession Poll 1483494222
TIMER| LoadDLL: 588.092 us
ERROR: NetClientSession Poll 1483494231
ERROR: NetClientSession Poll 1483494231

As we can see the ParseEntities part of the random map loading code already mutes the NetClient for 17 seconds, which can be sufficient to disconnect it. Two more undocumented tasks blocked the client for 14 seconds and 9 seconds respectively.

These tasks are apparently not being run in a separate thread like GenerateMap (and the entire server) do.

comment:4 by elexis, 7 years ago

Keywords: beta added
Milestone: BacklogAlpha 22

comment:5 by elexis, 7 years ago

TODO: We should likely thread the entire client instead of identifying the individual bottlenecks. Would be more sustainable as the design pattern avoids repetition of such bugs.

comment:6 by elexis, 7 years ago

Description: modified (diff)
Summary: Finish-rejoin lagRunning the NetClient in a separate thread

comment:7 by elexis, 7 years ago

Tale of a great reproduce:

  • _zoro_ had a game running today where the bug was entirely reproducible. Often certain clients have "too good" latency to the host, so that they in particular can't rejoin. But this was the only time where I saw the host having this issue with every client.
  • 2 clients got disconnected right away after the game started and the loading screen disappeared:
    • Hannibal Barca (with 250ms ping to the host) and
    • fpre (with 25ms ping). They both could not rejoin the game anymore without many retries.
  • Many many late observers followed failing to finish the rejoin.
  • Seeing the opportunity, I rejoined that game then about 12 times (also got banned, restarted the router and retried undercover).
    • When I had 50ms ping to the host, the rejoin failed each time!
    • When I added 500ms lag using comment:5:ticket:3264, the rejoin succeeded each time!

Eureka, the worse the latency, the better the connection!

comment:8 by elexis, 7 years ago

Priority: Must HaveRelease Blocker

TODO: We actually must thread the affected map loading code too, since the GUI is not rendered for the same time (even if the client was threaded) and the window appears to be frozen for that period of the loading screen.

Flagging this as a release blocker, since clients have very often timed out when starting the game initially or rejoining. We can release without a fix to it if we can't make it, but we should really attempt to solve.

comment:9 by Andy Alt, 7 years ago

Cc: andy011973@… added

comment:10 by Andy Alt, 7 years ago

Owner: set to Andy Alt
Status: newassigned

by Andy Alt, 7 years ago

Attachment: 3700.txt added

only some renaming done. The threading hasn't been added yet. Builds but fails the tests.

comment:11 by Andy Alt, 7 years ago

A list of files where I also needed to change some references to the CNetClient class:

source/network/NetSession.cpp & .h source/network/NetClientTurnManager.cpp and .h source/ps/GameSetup/GameSetup.cpp source/gui/scripting/ScriptFunctions.cpp

In the first attachment, I only renamed the classes and functions; the threading routines haven't been added yet. It builds but fails the test suite.

by Andy Alt, 7 years ago

Attachment: 3700.2.txt added

does not build yet

comment:12 by Andy Alt, 7 years ago

I created the CNetClient Class and moved some functions into it. It doesn't build though.

Build output @ 3700.2.txt (above attachment)

I don't see much point in me trying to fix these errors yet, because I actually am very unsure which functions to put where. Perhaps I'm getting closer though. Anyone want to help me further with this or is it best left to the professionals?

Last edited 7 years ago by Andy Alt (previous) (diff)

comment:13 by Andy Alt, 7 years ago

Keywords: rfc added

by Andy Alt, 7 years ago

Attachment: 3700.patch added

use patch extension, Andy

comment:14 by Andy Alt, 7 years ago

Keywords: rfc removed
Summary: Running the NetClient in a separate thread[PATCH] Running the NetClient in a separate thread

by Andy Alt, 7 years ago

Attachment: 3700.2.patch added

A lot has been converted, resembles the NetServer files much more closely; requires much fine-tuning yet. I think we're on the right track.

by Andy Alt, 7 years ago

Attachment: 3700.3.patch added

minor changes from the previous version, for the purpose of clarity as elexis and I discuss it further on IRC

by Andy Alt, 7 years ago

Attachment: 3700.4.patch added

A few changes after discussion with elexis. WIP

by Andy Alt, 7 years ago

Attachment: 3700.5.patch added

builds but doesn't pass the tests. Thread-related functions will need a lot of work. when I copied them from NetServer.cpp, I either deleted or commented out most of the definitions. Some clean-up of the header file will probably be needed, some prototypes may be duplicated or in the wrong place

comment:15 by Andy Alt, 7 years ago

I removed the trace function; elexis said it wasn't used anymore, and it doesn't appear in NetServer.xxx so I decided to nuke it.

by Andy Alt, 7 years ago

Attachment: 3700.6.patch added

compiles, passes tests, still WIP

comment:17 by elexis, 7 years ago

Description: modified (diff)
Milestone: Alpha 22Alpha 23

Hi andy!

I have to bring the bad news that we probably don't have the manpower * time to fix this for this release :-/ We will try to avoid NetClient changes until then as far as possible, so that the current patch won't become broken.

Note: See TracTickets for help on using tickets.