#3700 closed defect (fixed)
[PATCH] Running the NetClient in a separate thread
Reported by: | elexis | Owned by: | elexis |
---|---|---|---|
Priority: | Must Have | Milestone: | Alpha 24 |
Component: | Network | Keywords: | beta |
Cc: | andy011973@… | Patch: | Phab:D2848 |
Description (last modified by )
Since the NetClient
is running in the same thread as the map generation, the loading screen can entirely block the client for a couple dozen of seconds, thus the client appears to be offline ("losing the connection") while joining or rejoining the game. This can cause the disconnect of the client while loading the game (some players can't rejoin anymore).
Some of the map generator calls are threaded, but it were more sustainable if the client is executed in a separate thread, avoiding repetition of this problem.
Original report (only one symptom of the underlying problem):
Since a chat-message is displayed when a client has finished rejoining (#1949), one can easily observe a reoccuring, distinct type of lag ("finished-rejoin lag").
It happens whenever a game has been played for a while (much serialized data).
The game appears to be frozen for 5 to 15 seconds.
Here is what happens:
The client is starting to rejoin.The server sends the client a lot of serialized data.The client needs a significant time to deserialize it.Meanwhile the game has continued (i.e. new turns)In order to finish the rejoin, the client simulates the new turns while all other clients wait for him. This is the finish-rejoin lag.
With #3242 rejoins happen more often, so the problem accumulates more in that situation.
In order to solve or reduce the impact of the issue, the playing clients could progress one turn for every five turns that the rejoined client computes. Hence the rejoined client still catches up with the other clients, while those are still able to play.
Attachments (8)
Change History (40)
comment:1 by , 8 years ago
comment:3 by , 7 years ago
Priority: | If Time Permits → Must Have |
---|
The problem is more severe then lag. Some players can't rejoin a game anymore and become disconnected due to extensive lag after the loading screen!
It's not necessarily the syncing after the map load, but also parts of the map loading process itself. It can be reproduced by starting a huge map in multiplayer mode and adding a LOGERROR("NetClientSession Poll %d", std::time(nullptr));
to CNetClientSession::Poll()
:
ERROR: NetClientSession Poll 1483494187 ERROR: NetClientSession Poll 1483494187 ERROR: NetClientSession Poll 1483494188 TIMER| ParseTerrain: 37.8426 ms ERROR: NetClientSession Poll 1483494188 ERROR: NetClientSession Poll 1483494188 TIMER| ParseEntities: 17.3228 s ERROR: NetClientSession Poll 1483494206 ERROR: NetClientSession Poll 1483494206 ERROR: NetClientSession Poll 1483494206 ERROR: NetClientSession Poll 1483494206 TIMER| common/modern/setup.xml: 140.615 us TIMER| common/modern/styles.xml: 148.302 us TIMER| common/modern/sprites.xml: 1.91044 ms TIMER| common/setup.xml: 261.822 us TIMER| common/setup_resources.xml: 77.32 us TIMER| common/sprites.xml: 523.206 us TIMER| common/styles.xml: 115.968 us TIMER| session/sprites.xml: 4.15153 ms TIMER| session/styles.xml: 195.216 us TIMER| session/session.xml: 111.865 ms TIMER| common/global.xml: 6.56328 ms GAME STARTED, ALL INIT COMPLETE ERROR: NetClientSession Poll 1483494220 ERROR: NetClientSession Poll 1483494222 TIMER| LoadDLL: 588.092 us ERROR: NetClientSession Poll 1483494231 ERROR: NetClientSession Poll 1483494231
As we can see the ParseEntities
part of the random map loading code already mutes the NetClient
for 17 seconds, which can be sufficient to disconnect it.
Two more undocumented tasks blocked the client for 14 seconds and 9 seconds respectively.
These tasks are apparently not being run in a separate thread like GenerateMap
(and the entire server) do.
comment:4 by , 7 years ago
Keywords: | beta added |
---|---|
Milestone: | Backlog → Alpha 22 |
comment:5 by , 7 years ago
TODO: We should likely thread the entire client instead of identifying the individual bottlenecks. Would be more sustainable as the design pattern avoids repetition of such bugs.
comment:6 by , 7 years ago
Description: | modified (diff) |
---|---|
Summary: | Finish-rejoin lag → Running the NetClient in a separate thread |
comment:7 by , 7 years ago
Tale of a great reproduce:
- _zoro_ had a game running today where the bug was entirely reproducible. Often certain clients have "too good" latency to the host, so that they in particular can't rejoin. But this was the only time where I saw the host having this issue with every client.
- 2 clients got disconnected right away after the game started and the loading screen disappeared:
- Hannibal Barca (with 250ms ping to the host) and
- fpre (with 25ms ping). They both could not rejoin the game anymore without many retries.
- Many many late observers followed failing to finish the rejoin.
- Seeing the opportunity, I rejoined that game then about 12 times (also got banned, restarted the router and retried undercover).
- When I had 50ms ping to the host, the rejoin failed each time!
- When I added 500ms lag using comment:5:ticket:3264, the rejoin succeeded each time!
Eureka, the worse the latency, the better the connection!
comment:8 by , 7 years ago
Priority: | Must Have → Release Blocker |
---|
TODO: We actually must thread the affected map loading code too, since the GUI is not rendered for the same time (even if the client was threaded) and the window appears to be frozen for that period of the loading screen.
Flagging this as a release blocker, since clients have very often timed out when starting the game initially or rejoining. We can release without a fix to it if we can't make it, but we should really attempt to solve.
comment:9 by , 7 years ago
Cc: | added |
---|
comment:10 by , 7 years ago
Owner: | set to |
---|---|
Status: | new → assigned |
by , 7 years ago
only some renaming done. The threading hasn't been added yet. Builds but fails the tests.
comment:11 by , 7 years ago
A list of files where I also needed to change some references to the CNetClient class:
source/network/NetSession.cpp & .h source/network/NetClientTurnManager.cpp and .h source/ps/GameSetup/GameSetup.cpp source/gui/scripting/ScriptFunctions.cpp
In the first attachment, I only renamed the classes and functions; the threading routines haven't been added yet. It builds but fails the test suite.
comment:12 by , 7 years ago
Build output @ 3700.2.txt (above attachment)
I don't see much point in me trying to fix these errors yet, because I actually am very unsure which functions to put where. Perhaps I'm getting closer though. Anyone want to help me further with this or is it best left to the professionals?
comment:13 by , 7 years ago
Keywords: | rfc added |
---|
comment:14 by , 7 years ago
Keywords: | rfc removed |
---|---|
Summary: | Running the NetClient in a separate thread → [PATCH] Running the NetClient in a separate thread |
by , 7 years ago
Attachment: | 3700.2.patch added |
---|
A lot has been converted, resembles the NetServer files much more closely; requires much fine-tuning yet. I think we're on the right track.
by , 7 years ago
Attachment: | 3700.3.patch added |
---|
minor changes from the previous version, for the purpose of clarity as elexis and I discuss it further on IRC
by , 7 years ago
Attachment: | 3700.5.patch added |
---|
builds but doesn't pass the tests. Thread-related functions will need a lot of work. when I copied them from NetServer.cpp, I either deleted or commented out most of the definitions. Some clean-up of the header file will probably be needed, some prototypes may be duplicated or in the wrong place
comment:15 by , 7 years ago
I removed the trace function; elexis said it wasn't used anymore, and it doesn't appear in NetServer.xxx so I decided to nuke it.
comment:17 by , 7 years ago
Description: | modified (diff) |
---|---|
Milestone: | Alpha 22 → Alpha 23 |
Hi andy!
I have to bring the bad news that we probably don't have the manpower * time to fix this for this release :-/ We will try to avoid NetClient changes until then as far as possible, so that the current patch won't become broken.
comment:18 by , 7 years ago
I didn't forget about this, I just have to reduce the number of pending patches that are around before I can get into this rabbithole again. People greet me with "BAN THE LAGGER" on every single late-observer join, so yeah, we need this xD
Phab:D492 may or may not be committed before I (or anyone else) get to this patch. A good number of lines of NetClient.cpp
are touched, but changes are only syntax, no logic changes.
comment:19 by , 7 years ago
@elexis, I don't know if you still want the patch from the issue_3700 branch of my repo, but I rebased tonight (had to resolve some merge conflicts otherwise would have rebased sooner), so viewing the patch should be easier on the eyes now.
I'll unassign myself from this ticket now. Good luck!
comment:20 by , 7 years ago
Owner: | removed |
---|---|
Status: | assigned → new |
comment:21 by , 7 years ago
Do want. Just that it will consume about 2 weeks of time estimatedly. Usually do that when I'm completely fed up with that issue and take it personally :P
comment:22 by , 6 years ago
Milestone: | Alpha 23 → Alpha 24 |
---|---|
Priority: | Release Blocker → Must Have |
We decided to release in about a month or two if possible. After I finished my rmgen/
cleanup I will try to give this a shot. It shouldn't be too hard (read: something like 2 weeks).
comment:23 by , 6 years ago
Priority: | Must Have → Release Blocker |
---|
Some players even drop with 60 seconds of timeout tolerance and we definitely do not want players to have to wait 60 or more seconds after all other players besides the dropper have finished the loading screen.
It was the right choice to set this as a release blocker twice already, but it will cost probably 1-3 weeks to implement if lucky.
comment:26 by , 6 years ago
Hadn't implemented the rejoin and turnmnager transitions by then. It adds a bunch of code and queues that are copied over each tick, but still the right thing to do IMO. Sometimes a garbage collection segfault on shutdown I didn't comprehend yet.
comment:27 by , 6 years ago
Patch: | → Phab:D1513 |
---|
comment:28 by , 6 years ago
Patch: | Phab:D1513 |
---|
Phab:D1513 was the committed patch to extend the timeout tolerance to at least 1 minute during the loading screen.
The patch to implement threading in the netclient is not distributed currently.
comment:29 by , 5 years ago
Priority: | Release Blocker → Must Have |
---|
The same issue is stated in
NetServer.cpp
of r10437 (in particular the part about clients not being happy):