Opened 5 years ago

Last modified 3 months ago

#3700 new defect

[PATCH] Running the NetClient in a separate thread

Reported by: elexis Owned by: elexis
Priority: Must Have Milestone: Alpha 24
Component: Network Keywords: beta
Cc: andy011973@… Patch: Phab:D2848

Description (last modified by elexis)

Since the NetClient is running in the same thread as the map generation, the loading screen can entirely block the client for a couple dozen of seconds, thus the client appears to be offline ("losing the connection") while joining or rejoining the game. This can cause the disconnect of the client while loading the game (some players can't rejoin anymore).

Some of the map generator calls are threaded, but it were more sustainable if the client is executed in a separate thread, avoiding repetition of this problem.


Original report (only one symptom of the underlying problem): Since a chat-message is displayed when a client has finished rejoining (#1949), one can easily observe a reoccuring, distinct type of lag ("finished-rejoin lag").

It happens whenever a game has been played for a while (much serialized data). The game appears to be frozen for 5 to 15 seconds.

Here is what happens:

  1. The client is starting to rejoin.
  2. The server sends the client a lot of serialized data.
  3. The client needs a significant time to deserialize it.
  4. Meanwhile the game has continued (i.e. new turns)
  5. In order to finish the rejoin, the client simulates the new turns while all other clients wait for him. This is the finish-rejoin lag.

With #3242 rejoins happen more often, so the problem accumulates more in that situation.

In order to solve or reduce the impact of the issue, the playing clients could progress one turn for every five turns that the rejoined client computes. Hence the rejoined client still catches up with the other clients, while those are still able to play.

Attachments (8)

3700.txt (21.3 KB ) - added by Andy Alt 4 years ago.
only some renaming done. The threading hasn't been added yet. Builds but fails the tests.
3700.2.txt (22.7 KB ) - added by Andy Alt 4 years ago.
does not build yet
3700.patch (22.7 KB ) - added by Andy Alt 4 years ago.
use patch extension, Andy
3700.2.patch (26.4 KB ) - added by Andy Alt 4 years ago.
A lot has been converted, resembles the NetServer files much more closely; requires much fine-tuning yet. I think we're on the right track.
3700.3.patch (26.4 KB ) - added by Andy Alt 4 years ago.
minor changes from the previous version, for the purpose of clarity as elexis and I discuss it further on IRC
3700.4.patch (25.9 KB ) - added by Andy Alt 4 years ago.
A few changes after discussion with elexis. WIP
3700.5.patch (26.5 KB ) - added by Andy Alt 4 years ago.
builds but doesn't pass the tests. Thread-related functions will need a lot of work. when I copied them from NetServer.cpp, I either deleted or commented out most of the definitions. Some clean-up of the header file will probably be needed, some prototypes may be duplicated or in the wrong place
3700.6.patch (27.0 KB ) - added by Andy Alt 4 years ago.
compiles, passes tests, still WIP

Download all attachments as: .zip

Change History (38)

comment:1 by elexis, 5 years ago

The same issue is stated in NetServer.cpp of r10437 (in particular the part about clients not being happy):

bool CNetServerWorker::OnJoinSyncingLoadedGame(void* context, CFsmEvent* event)
{
	// A client rejoining an in-progress game has now finished loading the
	// map and deserialized the initial state.
	// The simulation may have progressed since then, so send any subsequent
	// commands to them and set them as an active player so they can participate
	// in all future turns.
	//
	// (TODO: if it takes a long time for them to receive and execute all these
	// commands, the other players will get frozen for that time and may be unhappy;
	// we could try repeating this process a few times until the client converges
	// on the up-to-date state, before setting them as active.)

comment:2 by elexis, 4 years ago

Component: Core engineNetwork

(set component to network)

comment:3 by elexis, 4 years ago

Priority: If Time PermitsMust Have

The problem is more severe then lag. Some players can't rejoin a game anymore and become disconnected due to extensive lag after the loading screen!

It's not necessarily the syncing after the map load, but also parts of the map loading process itself. It can be reproduced by starting a huge map in multiplayer mode and adding a LOGERROR("NetClientSession Poll %d", std::time(nullptr)); to CNetClientSession::Poll():

ERROR: NetClientSession Poll 1483494187
ERROR: NetClientSession Poll 1483494187
ERROR: NetClientSession Poll 1483494188
TIMER| ParseTerrain: 37.8426 ms
ERROR: NetClientSession Poll 1483494188
ERROR: NetClientSession Poll 1483494188
TIMER| ParseEntities: 17.3228 s
ERROR: NetClientSession Poll 1483494206
ERROR: NetClientSession Poll 1483494206
ERROR: NetClientSession Poll 1483494206
ERROR: NetClientSession Poll 1483494206
TIMER| common/modern/setup.xml: 140.615 us
TIMER| common/modern/styles.xml: 148.302 us
TIMER| common/modern/sprites.xml: 1.91044 ms
TIMER| common/setup.xml: 261.822 us
TIMER| common/setup_resources.xml: 77.32 us
TIMER| common/sprites.xml: 523.206 us
TIMER| common/styles.xml: 115.968 us
TIMER| session/sprites.xml: 4.15153 ms
TIMER| session/styles.xml: 195.216 us
TIMER| session/session.xml: 111.865 ms
TIMER| common/global.xml: 6.56328 ms
GAME STARTED, ALL INIT COMPLETE
ERROR: NetClientSession Poll 1483494220
ERROR: NetClientSession Poll 1483494222
TIMER| LoadDLL: 588.092 us
ERROR: NetClientSession Poll 1483494231
ERROR: NetClientSession Poll 1483494231

As we can see the ParseEntities part of the random map loading code already mutes the NetClient for 17 seconds, which can be sufficient to disconnect it. Two more undocumented tasks blocked the client for 14 seconds and 9 seconds respectively.

These tasks are apparently not being run in a separate thread like GenerateMap (and the entire server) do.

comment:4 by elexis, 4 years ago

Keywords: beta added
Milestone: BacklogAlpha 22

comment:5 by elexis, 4 years ago

TODO: We should likely thread the entire client instead of identifying the individual bottlenecks. Would be more sustainable as the design pattern avoids repetition of such bugs.

comment:6 by elexis, 4 years ago

Description: modified (diff)
Summary: Finish-rejoin lagRunning the NetClient in a separate thread

comment:7 by elexis, 4 years ago

Tale of a great reproduce:

  • _zoro_ had a game running today where the bug was entirely reproducible. Often certain clients have "too good" latency to the host, so that they in particular can't rejoin. But this was the only time where I saw the host having this issue with every client.
  • 2 clients got disconnected right away after the game started and the loading screen disappeared:
    • Hannibal Barca (with 250ms ping to the host) and
    • fpre (with 25ms ping). They both could not rejoin the game anymore without many retries.
  • Many many late observers followed failing to finish the rejoin.
  • Seeing the opportunity, I rejoined that game then about 12 times (also got banned, restarted the router and retried undercover).
    • When I had 50ms ping to the host, the rejoin failed each time!
    • When I added 500ms lag using comment:5:ticket:3264, the rejoin succeeded each time!

Eureka, the worse the latency, the better the connection!

comment:8 by elexis, 4 years ago

Priority: Must HaveRelease Blocker

TODO: We actually must thread the affected map loading code too, since the GUI is not rendered for the same time (even if the client was threaded) and the window appears to be frozen for that period of the loading screen.

Flagging this as a release blocker, since clients have very often timed out when starting the game initially or rejoining. We can release without a fix to it if we can't make it, but we should really attempt to solve.

comment:9 by Andy Alt, 4 years ago

Cc: andy011973@… added

comment:10 by Andy Alt, 4 years ago

Owner: set to Andy Alt
Status: newassigned

by Andy Alt, 4 years ago

Attachment: 3700.txt added

only some renaming done. The threading hasn't been added yet. Builds but fails the tests.

comment:11 by Andy Alt, 4 years ago

A list of files where I also needed to change some references to the CNetClient class:

source/network/NetSession.cpp & .h source/network/NetClientTurnManager.cpp and .h source/ps/GameSetup/GameSetup.cpp source/gui/scripting/ScriptFunctions.cpp

In the first attachment, I only renamed the classes and functions; the threading routines haven't been added yet. It builds but fails the test suite.

by Andy Alt, 4 years ago

Attachment: 3700.2.txt added

does not build yet

comment:12 by Andy Alt, 4 years ago

I created the CNetClient Class and moved some functions into it. It doesn't build though.

Build output @ 3700.2.txt (above attachment)

I don't see much point in me trying to fix these errors yet, because I actually am very unsure which functions to put where. Perhaps I'm getting closer though. Anyone want to help me further with this or is it best left to the professionals?

Last edited 4 years ago by Andy Alt (previous) (diff)

comment:13 by Andy Alt, 4 years ago

Keywords: rfc added

by Andy Alt, 4 years ago

Attachment: 3700.patch added

use patch extension, Andy

comment:14 by Andy Alt, 4 years ago

Keywords: rfc removed
Summary: Running the NetClient in a separate thread[PATCH] Running the NetClient in a separate thread

by Andy Alt, 4 years ago

Attachment: 3700.2.patch added

A lot has been converted, resembles the NetServer files much more closely; requires much fine-tuning yet. I think we're on the right track.

by Andy Alt, 4 years ago

Attachment: 3700.3.patch added

minor changes from the previous version, for the purpose of clarity as elexis and I discuss it further on IRC

by Andy Alt, 4 years ago

Attachment: 3700.4.patch added

A few changes after discussion with elexis. WIP

by Andy Alt, 4 years ago

Attachment: 3700.5.patch added

builds but doesn't pass the tests. Thread-related functions will need a lot of work. when I copied them from NetServer.cpp, I either deleted or commented out most of the definitions. Some clean-up of the header file will probably be needed, some prototypes may be duplicated or in the wrong place

comment:15 by Andy Alt, 4 years ago

I removed the trace function; elexis said it wasn't used anymore, and it doesn't appear in NetServer.xxx so I decided to nuke it.

by Andy Alt, 4 years ago

Attachment: 3700.6.patch added

compiles, passes tests, still WIP

comment:17 by elexis, 3 years ago

Description: modified (diff)
Milestone: Alpha 22Alpha 23

Hi andy!

I have to bring the bad news that we probably don't have the manpower * time to fix this for this release :-/ We will try to avoid NetClient changes until then as far as possible, so that the current patch won't become broken.

comment:18 by elexis, 3 years ago

I didn't forget about this, I just have to reduce the number of pending patches that are around before I can get into this rabbithole again. People greet me with "BAN THE LAGGER" on every single late-observer join, so yeah, we need this xD

Phab:D492 may or may not be committed before I (or anyone else) get to this patch. A good number of lines of NetClient.cpp are touched, but changes are only syntax, no logic changes.

comment:19 by Andy Alt, 3 years ago

@elexis, I don't know if you still want the patch from the issue_3700 branch of my repo, but I rebased tonight (had to resolve some merge conflicts otherwise would have rebased sooner), so viewing the patch should be easier on the eyes now.

I'll unassign myself from this ticket now. Good luck!

comment:20 by Andy Alt, 3 years ago

Owner: Andy Alt removed
Status: assignednew

comment:21 by elexis, 3 years ago

Do want. Just that it will consume about 2 weeks of time estimatedly. Usually do that when I'm completely fed up with that issue and take it personally :P

comment:22 by elexis, 3 years ago

Milestone: Alpha 23Alpha 24
Priority: Release BlockerMust Have

We decided to release in about a month or two if possible. After I finished my rmgen/ cleanup I will try to give this a shot. It shouldn't be too hard (read: something like 2 weeks).

comment:23 by elexis, 2 years ago

Priority: Must HaveRelease Blocker

Some players even drop with 60 seconds of timeout tolerance and we definitely do not want players to have to wait 60 or more seconds after all other players besides the dropper have finished the loading screen.

It was the right choice to set this as a release blocker twice already, but it will cost probably 1-3 weeks to implement if lucky.

comment:24 by elexis, 2 years ago

In 21842:

Prevent players from disconnecting during the loading screen by increasing the timeout tolerance to 60 seconds for that period, fixes #5163.

The NetClient runs in the main thread, so any part of the loading screen consuming several seconds makes that client timeout.
This is a workaround because threading the NetClient would have prevent these timeouts, refs #3700.
Coutnerintuitively, since enet timeout tolerance is proportional to the latency, the better the connection of the player, the more likely it was to drop on gamestart.

This problem became very frequent in Alpha 23, at least due to the Aura bugfix rP21785, AIInterface being particularly slow and that not having been disabled yet in the loading screen resulting in additional 10 second freezes during the loading screen, even on empty maps, refs #5200, rP21838.

Differential Revision: https://code.wildfiregames.com/D1513
Based on patch by: causative

comment:25 by elexis, 2 years ago

Owner: set to elexis

Wrote a working patch from scratch today.

comment:26 by elexis, 2 years ago

Hadn't implemented the rejoin and turnmnager transitions by then. It adds a bunch of code and queues that are copied over each tick, but still the right thing to do IMO. Sometimes a garbage collection segfault on shutdown I didn't comprehend yet.

comment:27 by Andy Alt, 2 years ago

Patch: Phab:D1513

comment:28 by elexis, 2 years ago

Patch: Phab:D1513

Phab:D1513 was the committed patch to extend the timeout tolerance to at least 1 minute during the loading screen.

The patch to implement threading in the netclient is not distributed currently.

comment:29 by elexis, 16 months ago

Priority: Release BlockerMust Have

comment:30 by stanislas69, 3 months ago

Patch: Phab:D2848

Add currently broken patch link. help is welcome

Note: See TracTickets for help on using tickets.