Changes between Version 22 and Version 23 of Debugging


Ignore:
Timestamp:
Jun 17, 2021, 9:49:21 AM (3 years ago)
Author:
wraitii
Comment:

Update OOS doc

Legend:

Unmodified
Added
Removed
Modified
  • Debugging

    v22 v23  
    9090
    9191== Debugging Out of Sync and Serialization Errors ==
     92
     93=== Replay mode ===
     94Replays are created for every game session, see `replays` in the userdata directory (GameDataPaths). They are organized by date. `commands.txt` stores not only commands sent by each player each turn, but periodically the hash of the simulation state on a given turn. Replay mode is activated with the following command line option: `-replay=/path/to/commands.txt`. It's best to run the replay in a debugger (Windows) or command prompt (*nix) to view its output, no separate window is created.
     95
     96'''Important:''' Don't forget to specify which mod(s) to load with the command `mod=public` (or whatever mod you want to load) when running replays. Else the loader won't be able to find the mod-dependant data (like maps, pathfinding codes, etc).
     97
     98When replaying a `commands.txt`, the hashes are checked periodically. This can also be used to verify that no breaking changes are introduced by a new patch. Replay mode also creates a `profile.txt` with profiling information every 20 turns, this can be processed into a nice graph as explained [wiki:EngineProfiling#Creatinggraphs here].
     99
    92100=== Out of sync ===
    93 Out of sync (OOS) and serialization errors are generally difficult to debug, but knowing where to look can make this process simpler. An OOS error occurs in a multiplayer game when one player's serialized simulation state isn't identical to another player's serialized simulation state (breaking the concept of network synchronization). The following data are useful to collect in this case:
     101Out of sync (OOS) and serialization errors are generally difficult to debug, but knowing where to look can make this process simpler. An OOS error occurs in a multiplayer game when one player's serialized simulation state isn't identical to another player's serialized simulation state (breaking the concept of network synchronization). This is detected in-game by hashing the serialisation data. There are two types of hashes: quick-hashes that only hash positions and the "full" ones.
    94102
    95  * `oos_dump.txt` - a human readable snapshot of the simulation state at the point of OOS detection, created on each player's computer. Found in the logs folder, see GameDataPaths. Each player should zip these files and send them to the person troubleshooting the bug.
    96  * Each player's game version - these have to match, while the game is in alpha phase the simulation components change constantly and there is no backward compatibility. For releases, it means using the same alpha release, for SVN users, it means using the same SVN revision (with few exceptions).
     103The following data are useful to collect to debug OOS:
     104
     105 * `oos_dump.txt` - a human readable snapshot of the simulation state near the point of OOS detection, created on each player's computer. Found in the logs folder, see GameDataPaths. Each player should zip these files and send them to the person troubleshooting the bug. The most useful one to have is the OOS player. Note that OOS logs may not represent the exact turn of the OOS (because of network delays): that can be accounted for, see below.
     106 * `commands.txt` - this is the commands issued by each player during the game, which can be used to replay the game exactly as it happened. The commands should also contain the hashes: you should be able to replay the OOS or non-OOS track with a given revision.
     107
     108Other useful information:
     109 * Each player's game version - SVN revision, recompilation status, presence of mods. Note that players saying they are on the same revision can be reasonably doubted, as there's more than a few moving parts.
    97110 * OS and hardware info for each player (`system_info.txt`) - Some serialization bugs are platform specific, so knowing the systems involved is key to reproducing the error.
    98  * `commands.txt` - this is the commands issued by each player during the game, which can be used to replay the game exactly as it happened.
    99111
    100 The easiest place to begin is doing a simple text diff of the `oos_dump.txt` files to see where they differ. In Windows, you can use the diff tool built into `TortoiseSVN`, `TortoiseGit`, or some other tool, on *nix you simply use `diff`. Your diff tool may not like the binary data spit out by the CCmpAIManager component, so you can remove that by editing manually in a text editor. Note that in multiplayer games, a hash of the full simulation state only occurs every 20 turns for performance reasons, so it's possible that the states began diverging earlier!
     112==== Known causes of OOS ====
     113
     114OOS errors can be tricky, and can have a wild range of causes. Sometimes the cause is improbable or unexpected, so it pays to keep an open mind. It also pays to know what the most common causes are:
     115
     116 * Rejoining a multiplayer game with AIs - because the AIs don't fully serialize their state, the rejoining player's state will differ and cause an OOS, see #1089.
     117
     118Common causes:
     119 * For the past few alphas OOS have mostly been "OOS on rejoin". If that's the case, the most likely explanation is a bad serialisation (something is forgotten or incorrect). In those cases, one should check the OOS_dump for deltas (see below), and the most obvious candidates are the code that has been changed recently. The -rejointest (see below) can usually make debugging these easy.
     120 * A small part of all OOS issues are actually caused by Undefined Behaviour in C++. This is a long topic, but the most likely cause is that some variable was not initialised properly. Again, this should be detectable via the OOS dump, and the recent commits are the most suspicious.
     121
     122Less likely causes:
     123 * Actual non-determinism is possible, though we have not seen one for years. This could be caused by e.g. the computer architecture differing.
     124 * SpiderMonkey has some JIT compiler issues (see #2000 for example). If you want to rule out the JIT compiler, you can disable it by uncommenting the lines in ScriptInterface.cpp that set the options JSOPTION_JIT and JSOPTION_METHODJIT. If it is the JIT compiler, you should still try to find the exact location of the problem because disabling JIT compiling completely is bad for performance.
     125 * Data affecting the simulation is kept past the runtime of one game (see #2285 for example). In this case OOS errors typically occur if one or more players have previously played another game without shutting down the engine afterwards. It's difficult to troubleshoot with replays because they always start a fresh instance of the engine.
     126 * Due to SpiderMonkey internals, floating point operations can differ slightly on different machines and different architectures (see [https://bugzilla.mozilla.org/show_bug.cgi?id=531915 Bugzilla #531915]).
     127
     128==== Diffing the OOS dump ====
     129The easiest place to begin is doing a simple text diff of the `oos_dump.txt` files to see where they differ. In Windows, you can use the diff tool built into `TortoiseSVN`, `TortoiseGit`, or some other tool, on *nix you simply use `diff`. Your diff tool may not like the binary data spit out by the CCmpAIManager component, so you can remove that by editing manually in a text editor. Note that in multiplayer games, a hash of the full simulation state only occurs every 20 turns for performance reasons, so it's possible that the states began diverging earlier
    101130
    102131In the best case, you will see a small diff of changes comparing two or more of the dump files. This can point you to the component and property that differ, then by analyzing the code that writes to that property, you can see if it does anything unsafe, or (in the case of a C++ component) if it's not serialized correctly.
    103132
    104 However, it may be that you won't see any diff, or maybe it will be huge and affect many entities and components. If there's no diff, that means the simulation state differed, but the difference doesn't affect the debug serializer. There are a few reasons why this could happen, but most likely the [wiki:JSON] representation of the dump doesn't allow the actual value or (for a JavaScript value) the difference is in SpiderMonkey's internal representation of the data. This has been reported before with e.g. `NaN`, having a single bit difference depending on the JIT behavior (see #1879).
     133The diff will contain false positives: range visualisations & local entities (waypoints...) appear, but are not part of the actual serialized data. Those can be ignored.
    105134
    106 Even OOS dumps with many differences can be useful. Differences between local entities (previews, waypoint flags) can be disregarded, as they don't affect the serialized game state. If you notice one component is especially affected or a component of a recently added feature, then it's often possible (and desirable) to construct a much simpler test case. An OOS in an hour long multiplayer game is complicated, but an OOS or serialization failure on a blank map with a handful of entities is much simpler.
     135It is possible that the difference only happens in the 'binary' OOS dump. Usually, this happens because the [wiki:JSON] representation of the dump doesn't allow the actual value or (for a JavaScript value) the difference is in SpiderMonkey's internal representation of the data. This has been reported before with e.g. `NaN`, having a single bit difference depending on the JIT behavior (see #1879). Note that NaN is no longer serialized.
    107136
    108 On Linux you can safely start two instances of 0 A.D., host with one and rejoin with the other. This way you can test whether a certain behavior triggers an out-of-sync error.
    109 If you do that, you can use [attachment:t3339_command_line_option_ooslog_unique_v1.2.patch:ticket:3339 this patch] to prevent the oos_dump.txt from being overwritten, allowing you to compare the newly produced oos-dumps.
     137You can apply a patch to manually generate OOS dumps on a replay, and save/load to cut down time: https://code.wildfiregames.com/P237#2058
    110138
    111 Otherwise, you can use virtual machines in e.g. !VirtualBox to test mutiplayer games on a single machine. This is especially useful for testing cross-platform issues e.g. a Windows host and Linux or macOS client.
     139==== Rejointest & Serialization Test ====
     140
     141Two options can be used to reproduce OOS: the `-rejointest=[turn]` and the `-serializationtest` option.
     142
     143The former is much faster and often sufficient: it starts a second instance of the game at a given turn (as if a new player rejoined the game) and the runs it, comparing hashes. If you have an OOS on rejoin, a little exploration will usually trigger it.
     144
     145The `-serializationtest` command line option tells the simulation to do a full test of the simulation state ''every turn''. This will be extremely slow compared to a normal run of the game, but can reveal problems the moment they occur. '''Note:''' it will currently always fail with AI players, since they don't serialize properly.[[BR]]Note: sometimes, OOS only happen over several turns. This makes them unreproducible with the basic serializationtest.
     146
     147When the serialization test fails, an error window will be shown, but the more useful data is created in the game's log folder (see GameDataPaths) inside `oos_log`. In the following filenames, '''.a''' means data from the primary simulation state (e.g. the one used in a typical game), while '''.b''' is data from the secondary simulation state (e.g. the one being reconstructed every turn to compare with the primary state). When they differ, it's a serialization test failure.
     148
     149 * `debug.after.a` / `debug.after.b` - A good place to begin, these are the debug output of the serializer '''after''' the current turn updates and can be compared with a diff tool (like `oos_dump.txt` for OOS errors).
     150 * `state.after.a` / `state.after.b` - Binary dump of the simulation state '''after''' the current turn update occurs. If an error occurs here, it is probably because some data that affects the simulation state isn't being serialized.
     151 * `state.before.a` / `state.before.b` - Binary dump of the simulation state '''before''' the current turn update occurs. If an error occurs here, it's probably a bug in the (de)serializer or the way it is (de)serializing the data.
     152 * `hash.after.a` / `hash.after.b` / `hash.before.a` / `hash.before.b` - hash values used to compare the above states.
     153
     154If the debug dumps aren't helpful, the binary dumps can be viewed with a hex editor. Perhaps even more useful is a binary diff tool that can show exactly the offset of each difference. Because some ASCII data gets encoded in the binary simulation state, it is often possible to determine which component and property contained the difference.
     155
     156==== OOS logging mode ====
     157There is another useful option for debugging serialization errors, the `-ooslog` option. This will dump the simulation state in both binary and debug form every turn. The game will run much slower with this option, but for multiplayer testing, it can be useful to determine exactly when the states differed between two players, and how they differed.
    112158
    113159=== OOS caused by the AI ===
     
    125171Now you can add more debugging output in the part of the AI that is responsible for sending the differing commands and try to figure out why it's happening. If you add the same debugging output on multiple machines, you can use diff to easily find the place in the log where the first difference is found.
    126172
    127 === Known causes of OOS ===
    128 The following are known, reproducible causes of OOS errors that currently aren't solved:
    129 
    130  * Rejoining a multiplayer game with AIs - because the AIs don't fully serialize their state, the rejoining player's state will differ and cause an OOS, see #1089.
    131 
    132 The following are possible sources of OOS problems we found in the past:
    133 
    134  * SpiderMonkey has some JIT compiler issues (see #2000 for example). If you want to rule out the JIT compiler, you can disable it by uncommenting the lines in ScriptInterface.cpp that set the options JSOPTION_JIT and JSOPTION_METHODJIT. If it is the JIT compiler, you should still try to find the exact location of the problem because disabling JIT compiling completely is bad for performance.
    135  * Uninitialized variables can cause different behaviour because their value depends on random memory content. Valgrind helps detecting these errors (more detailed explanation needed).
    136  * Data affecting the simulation is kept past the runtime of one game (see #2285 for example). In this case OOS errors typically occur if one or more players have previously played another game without shutting down the engine afterwards. It's difficult to troubleshoot with replays because they always start a fresh instance of the engine.
    137  * Due to SpiderMonkey internals, floating point operations can differ slightly on different machines and different architectures (see [https://bugzilla.mozilla.org/show_bug.cgi?id=531915 Bugzilla #531915]).
    138 
    139 === Serialization test mode ===
    140 The engine has a very useful test mode for debugging serialization errors. Using the `-serializationtest` command line option tells the simulation to do a full test of the simulation state ''every turn''. This will be extremely slow compared to a normal run of the game, but can reveal problems the moment they occur. '''Note:''' it will currently always fail with AI players, since they don't serialize properly.[[BR]]Note: sometimes, OOS only happen over several turns. This makes them unreproducible with the basic serializationtest. In this situation, use the patch in #4242. It will start the second simulation at a given turn, and form then on will update it as if it were a second player.
    141 
    142 When the serialization test fails, an error window will be shown, but the more useful data is created in the game's log folder (see GameDataPaths) inside `oos_log`. In the following filenames, '''.a''' means data from the primary simulation state (e.g. the one used in a typical game), while '''.b''' is data from the secondary simulation state (e.g. the one being reconstructed every turn to compare with the primary state). When they differ, it's a serialization test failure.
    143 
    144  * `debug.after.a` / `debug.after.b` - A good place to begin, these are the debug output of the serializer '''after''' the current turn updates and can be compared with a diff tool (like `oos_dump.txt` for OOS errors).
    145  * `state.after.a` / `state.after.b` - Binary dump of the simulation state '''after''' the current turn update occurs. If an error occurs here, it is probably because some data that affects the simulation state isn't being serialized.
    146  * `state.before.a` / `state.before.b` - Binary dump of the simulation state '''before''' the current turn update occurs. If an error occurs here, it's probably a bug in the (de)serializer or the way it is (de)serializing the data.
    147  * `hash.after.a` / `hash.after.b` / `hash.before.a` / `hash.before.b` - hash values used to compare the above states.
    148 
    149 If the debug dumps aren't helpful, the binary dumps can be viewed with a hex editor. Perhaps even more useful is a binary diff tool that can show exactly the offset of each difference. Because some ASCII data gets encoded in the binary simulation state, it is often possible to determine which component and property contained the difference.
    150 
    151 === OOS logging mode ===
    152 There is another useful option for debugging serialization errors, the `-ooslog` option. This will dump the simulation state in both binary and debug form every turn. The game will run much slower with this option, but for multiplayer testing, it can be useful to determine exactly when the states differed between two players, and how they differed.
    153 
    154 === Replay mode ===
    155 Replays are created for every game session, see `replays` in the userdata directory (GameDataPaths). They are organized by date. `commands.txt` stores not only commands sent by each player each turn, but periodically the hash of the simulation state on a given turn. Replay mode is activated with the following command line option: `-replay=/path/to/commands.txt`. It's best to run the replay in a debugger (Windows) or command prompt (*nix) to view its output, no separate window is created.
    156 
    157 '''Important:''' Don't forget to specify which mod(s) to load with the command `mod=public` (or whatever mod you want to load) when running replays. Else the loader won't be able to find the mod-dependant data (like maps, pathfinding codes, etc).
    158 
    159 When replaying a `commands.txt`, the hashes are checked periodically. This can also be used to verify that no breaking changes are introduced by a new patch. Replay mode also creates a `profile.txt` with profiling information every 20 turns, this can be processed into a nice graph as explained [wiki:EngineProfiling#Creatinggraphs here].
    160 
    161 === Other tips for serializer debugging ===
     173==== Other tips for serializer debugging ====
    162174 * '''Debug annotations''' can prove very helpful when viewing the binary simulation state in a hex editor. The dump will be much larger but it will also contain more textual data. Set `DEBUG_SERIALIZER_ANNOTATE` to 1 in [source:/ps/source/simulation2/serialization/StdSerializer.h StdSerializer.h].
    163175 * '''Check hashes more frequently'''. By default, the game balances e.g. OOS checks with the performance impact of serializing and hashing the state. You might find it helpful to change e.g. `CReplayPlayer::Replay` in [source:/ps/source/ps/Replay.cpp Replay.cpp] to check hashes more frequently in replay mode, you can also generate hashes more frequently by changing `CNetTurnManager::TurnNeedsFullHash` (multiplayer games) or `CNetLocalTurnManager::NotifyFinishedUpdate` (single player games) in [source:/ps/source/network/NetTurnManager.cpp NetTurnManager.cpp].