Opened 8 years ago

Last modified 6 years ago

#3941 new defect

Game freezes when lobby is down

Reported by: Imarok Owned by: sbirmi
Priority: Must Have Milestone: Backlog
Component: Multiplayer lobby Keywords:
Cc: Patch:

Description (last modified by elexis)

When the lobby is not available and you try to join the lobby the game freezes. It seems like it gets stuck in a loop trying to connect to the lobby. observed on:

  • Arch linux (echotangoecho)
  • Ubuntu (me)

There should be just a warning that the lobby is not available at the moment.

Change History (9)

comment:1 by elexis, 8 years ago

Component: Core engineMultiplayer lobby

Easily reproducible (at least here on linux) by simulating enormous lag when attempting to connect (see comment:5:ticket:3264).

comment:2 by sbirmi, 7 years ago

Owner: set to sbirmi

comment:3 by sbirmi, 7 years ago

Small update. Added delay as elexis mentioned in ticket3264. I see the following code path.

prelobby.xml triggers lobbyStartConnect().

 lobbyStartConnect():
 -> Engine.ConnectXmppClient():
    -> (XmppClient)g_XmppClient->connect():
       -> (glooxwrapper::Client)m_client->connect(block=false):
            
+           debug_printf("glooxwrapper::Client::connect block=%d\n", block);
            (gloox::Client)m_Wrapped->connect(block)  <-- we are getting stuck here
+           debug_printf("glooxwrapper::Client::connect returned\n");

(For completeness) The header file for gloox::Client (libraries/win32/gloox/include/gloox/clientbase.h) says the following:

      /**
       * Initiates the connection to a server. This function blocks as long as a connection is
       * established.
       * You can have the connection block 'til the end of the connection, or you can have it return
       * immediately. If you choose the latter, its your responsibility to call @ref recv() every now
       * and then to actually receive data from the socket and to feed the parser.
       * @param block @b True for blocking, @b false for non-blocking connect. Defaults to @b true.
       * @return @b False if prerequisits are not met (server not set) or if the connection was refused,
       * @b true otherwise.
       * @note Since 0.9 @link ConnectionListener::onDisconnect() onDisconnect() @endlink is called
       * in addition to a return value of @b false.
       */
      bool connect( bool block = true );

comment:4 by sbirmi, 7 years ago

Hmm.. my initial hunch is that this method of testing (adding delay through the linux kernel) maybe the wrong way to test. This would definitely delay DNS resolution for one. Even if gloox wants to return control early, it may block until DNS resolution completes. Need to look into gloox's code for that.

comment:5 by elexis, 7 years ago

DNS query results are cached afaik

comment:6 by sbirmi, 7 years ago

I ran tcpdump (tcpdump -n -i wlan0 port 53 -vv -bb) to see what happens.

(without delay)

20:40:10.769693 IP (tos 0x0, ttl 64, id 51803, offset 0, flags [DF], proto UDP (17), length 87)
    192.168.1.6.40270 > 54.186.51.153.53: [udp sum ok] 38598+ SRV? _xmpp-client._tcp.lobby.wildfiregames.com. (59)
20:40:15.774832 IP (tos 0x0, ttl 64, id 57376, offset 0, flags [DF], proto UDP (17), length 87)
    192.168.1.6.58165 > 8.8.8.8.53: [udp sum ok] 38598+ SRV? _xmpp-client._tcp.lobby.wildfiregames.com. (59)
20:40:15.835430 IP (tos 0x0, ttl 59, id 10437, offset 0, flags [none], proto UDP (17), length 150)
    8.8.8.8.53 > 192.168.1.6.58165: [udp sum ok] 38598 NXDomain q: SRV? _xmpp-client._tcp.lobby.wildfiregames.com. 0/1/0 ns: wildfiregames.com. SOA ns1.middelkoop.cc. webmaster.middelkoop.cc. 2016112801 10800 3600 1209600 43200 (122)
20:40:15.835587 IP (tos 0x0, ttl 64, id 52648, offset 0, flags [DF], proto UDP (17), length 69)
    192.168.1.6.59518 > 54.186.51.153.53: [udp sum ok] 21087+ A? lobby.wildfiregames.com. (41)
20:40:20.838581 IP (tos 0x0, ttl 64, id 57762, offset 0, flags [DF], proto UDP (17), length 69)
    192.168.1.6.60424 > 8.8.8.8.53: [udp sum ok] 21087+ A? lobby.wildfiregames.com. (41)
20:40:21.028248 IP (tos 0x0, ttl 59, id 55757, offset 0, flags [none], proto UDP (17), length 85)
    8.8.8.8.53 > 192.168.1.6.60424: [udp sum ok] 21087 q: A? lobby.wildfiregames.com. 1/0/0 lobby.wildfiregames.com. A 136.243.15.233 (57)

In a different terminal, I add 3000ms delay (qdisc delay). Without exiting 0ad, I exited the lobby and tried connecting again (this is the same tcpdump session running). I see multiple dns packets going out still.

20:40:36.353887 IP (tos 0x0, ttl 64, id 56630, offset 0, flags [DF], proto UDP (17), length 87)
    192.168.1.6.55245 > 54.186.51.153.53: [udp sum ok] 24635+ SRV? _xmpp-client._tcp.lobby.wildfiregames.com. (59)
20:40:41.356245 IP (tos 0x0, ttl 64, id 58508, offset 0, flags [DF], proto UDP (17), length 87)
    192.168.1.6.43371 > 8.8.8.8.53: [udp sum ok] 24635+ SRV? _xmpp-client._tcp.lobby.wildfiregames.com. (59)
20:40:41.416895 IP (tos 0x0, ttl 59, id 7634, offset 0, flags [none], proto UDP (17), length 150)
    8.8.8.8.53 > 192.168.1.6.43371: [udp sum ok] 24635 NXDomain q: SRV? _xmpp-client._tcp.lobby.wildfiregames.com. 0/1/0 ns: wildfiregames.com. SOA ns1.middelkoop.cc. webmaster.middelkoop.cc. 2016112801 10800 3600 1209600 43200 (122)
20:40:44.417082 IP (tos 0x0, ttl 64, id 57305, offset 0, flags [DF], proto UDP (17), length 69)
    192.168.1.6.40101 > 54.186.51.153.53: [udp sum ok] 59617+ A? lobby.wildfiregames.com. (41)  <<<< *we returned to 0ad code around here* the unresponsive session was till here.
20:40:49.422245 IP (tos 0x0, ttl 64, id 59016, offset 0, flags [DF], proto UDP (17), length 69)
    192.168.1.6.46767 > 8.8.8.8.53: [udp sum ok] 59617+ A? lobby.wildfiregames.com. (41)
20:40:49.480770 IP (tos 0x0, ttl 59, id 12054, offset 0, flags [none], proto UDP (17), length 85)
    8.8.8.8.53 > 192.168.1.6.46767: [udp sum ok] 59617 q: A? lobby.wildfiregames.com. 1/0/0 lobby.wildfiregames.com. A 136.243.15.233 (57)

The unresponsive behavior happens while we are in gloox code. We stalled in the second attempt here as well. That being said, there is definitely one bug here which is reproduced the following way:

In lobbyStartConnect(), as as soon as Engine.ConnectXmppClient() returns, we render "Connecting..." for example. Now, control is resumed to the gui. I tried hitting "Cancel". The connect screen returns to page "pageWelcome". However,

  1. feedback.caption is not reset (so you see "Connecting..." appearing in the backend).
  2. also, looks like we probably (don't reset g_LobbyIsConnecting or something similar). We actually end up connecting to the lobby (the UI switches to the next page) still.

comment:7 by sbirmi, 7 years ago

I don't see us caching dns responses (on linux). For example, this (2010 article) https://www.linux.com/learn/weekend-project-cache-local-dns-queries seems to imply it doesn't. I don't have bind running FYI.

Quick googling shows systemd-resolved (which from the name seems relevant) is not running on my system. Neither is my system running pdnsd, dnsmasq, nor nscd.

comment:8 by sbirmi, 7 years ago

Filed #4438 for the issue described in comment 6 above.

comment:9 by elexis, 6 years ago

Description: modified (diff)

Threading.

Note: See TracTickets for help on using tickets.