95 | | * `oos_dump.txt` - a human readable snapshot of the simulation state at the point of OOS detection, created on each player's computer. Found in the logs folder, see GameDataPaths. Each player should zip these files and send them to the person troubleshooting the bug. |
96 | | * Each player's game version - these have to match, while the game is in alpha phase the simulation components change constantly and there is no backward compatibility. For releases, it means using the same alpha release, for SVN users, it means using the same SVN revision (with few exceptions). |
| 103 | The following data are useful to collect to debug OOS: |
| 104 | |
| 105 | * `oos_dump.txt` - a human readable snapshot of the simulation state near the point of OOS detection, created on each player's computer. Found in the logs folder, see GameDataPaths. Each player should zip these files and send them to the person troubleshooting the bug. The most useful one to have is the OOS player. Note that OOS logs may not represent the exact turn of the OOS (because of network delays): that can be accounted for, see below. |
| 106 | * `commands.txt` - this is the commands issued by each player during the game, which can be used to replay the game exactly as it happened. The commands should also contain the hashes: you should be able to replay the OOS or non-OOS track with a given revision. |
| 107 | |
| 108 | Other useful information: |
| 109 | * Each player's game version - SVN revision, recompilation status, presence of mods. Note that players saying they are on the same revision can be reasonably doubted, as there's more than a few moving parts. |
100 | | The easiest place to begin is doing a simple text diff of the `oos_dump.txt` files to see where they differ. In Windows, you can use the diff tool built into `TortoiseSVN`, `TortoiseGit`, or some other tool, on *nix you simply use `diff`. Your diff tool may not like the binary data spit out by the CCmpAIManager component, so you can remove that by editing manually in a text editor. Note that in multiplayer games, a hash of the full simulation state only occurs every 20 turns for performance reasons, so it's possible that the states began diverging earlier! |
| 112 | ==== Known causes of OOS ==== |
| 113 | |
| 114 | OOS errors can be tricky, and can have a wild range of causes. Sometimes the cause is improbable or unexpected, so it pays to keep an open mind. It also pays to know what the most common causes are: |
| 115 | |
| 116 | * Rejoining a multiplayer game with AIs - because the AIs don't fully serialize their state, the rejoining player's state will differ and cause an OOS, see #1089. |
| 117 | |
| 118 | Common causes: |
| 119 | * For the past few alphas OOS have mostly been "OOS on rejoin". If that's the case, the most likely explanation is a bad serialisation (something is forgotten or incorrect). In those cases, one should check the OOS_dump for deltas (see below), and the most obvious candidates are the code that has been changed recently. The -rejointest (see below) can usually make debugging these easy. |
| 120 | * A small part of all OOS issues are actually caused by Undefined Behaviour in C++. This is a long topic, but the most likely cause is that some variable was not initialised properly. Again, this should be detectable via the OOS dump, and the recent commits are the most suspicious. |
| 121 | |
| 122 | Less likely causes: |
| 123 | * Actual non-determinism is possible, though we have not seen one for years. This could be caused by e.g. the computer architecture differing. |
| 124 | * SpiderMonkey has some JIT compiler issues (see #2000 for example). If you want to rule out the JIT compiler, you can disable it by uncommenting the lines in ScriptInterface.cpp that set the options JSOPTION_JIT and JSOPTION_METHODJIT. If it is the JIT compiler, you should still try to find the exact location of the problem because disabling JIT compiling completely is bad for performance. |
| 125 | * Data affecting the simulation is kept past the runtime of one game (see #2285 for example). In this case OOS errors typically occur if one or more players have previously played another game without shutting down the engine afterwards. It's difficult to troubleshoot with replays because they always start a fresh instance of the engine. |
| 126 | * Due to SpiderMonkey internals, floating point operations can differ slightly on different machines and different architectures (see [https://bugzilla.mozilla.org/show_bug.cgi?id=531915 Bugzilla #531915]). |
| 127 | |
| 128 | ==== Diffing the OOS dump ==== |
| 129 | The easiest place to begin is doing a simple text diff of the `oos_dump.txt` files to see where they differ. In Windows, you can use the diff tool built into `TortoiseSVN`, `TortoiseGit`, or some other tool, on *nix you simply use `diff`. Your diff tool may not like the binary data spit out by the CCmpAIManager component, so you can remove that by editing manually in a text editor. Note that in multiplayer games, a hash of the full simulation state only occurs every 20 turns for performance reasons, so it's possible that the states began diverging earlier |
111 | | Otherwise, you can use virtual machines in e.g. !VirtualBox to test mutiplayer games on a single machine. This is especially useful for testing cross-platform issues e.g. a Windows host and Linux or macOS client. |
| 139 | ==== Rejointest & Serialization Test ==== |
| 140 | |
| 141 | Two options can be used to reproduce OOS: the `-rejointest=[turn]` and the `-serializationtest` option. |
| 142 | |
| 143 | The former is much faster and often sufficient: it starts a second instance of the game at a given turn (as if a new player rejoined the game) and the runs it, comparing hashes. If you have an OOS on rejoin, a little exploration will usually trigger it. |
| 144 | |
| 145 | The `-serializationtest` command line option tells the simulation to do a full test of the simulation state ''every turn''. This will be extremely slow compared to a normal run of the game, but can reveal problems the moment they occur. '''Note:''' it will currently always fail with AI players, since they don't serialize properly.[[BR]]Note: sometimes, OOS only happen over several turns. This makes them unreproducible with the basic serializationtest. |
| 146 | |
| 147 | When the serialization test fails, an error window will be shown, but the more useful data is created in the game's log folder (see GameDataPaths) inside `oos_log`. In the following filenames, '''.a''' means data from the primary simulation state (e.g. the one used in a typical game), while '''.b''' is data from the secondary simulation state (e.g. the one being reconstructed every turn to compare with the primary state). When they differ, it's a serialization test failure. |
| 148 | |
| 149 | * `debug.after.a` / `debug.after.b` - A good place to begin, these are the debug output of the serializer '''after''' the current turn updates and can be compared with a diff tool (like `oos_dump.txt` for OOS errors). |
| 150 | * `state.after.a` / `state.after.b` - Binary dump of the simulation state '''after''' the current turn update occurs. If an error occurs here, it is probably because some data that affects the simulation state isn't being serialized. |
| 151 | * `state.before.a` / `state.before.b` - Binary dump of the simulation state '''before''' the current turn update occurs. If an error occurs here, it's probably a bug in the (de)serializer or the way it is (de)serializing the data. |
| 152 | * `hash.after.a` / `hash.after.b` / `hash.before.a` / `hash.before.b` - hash values used to compare the above states. |
| 153 | |
| 154 | If the debug dumps aren't helpful, the binary dumps can be viewed with a hex editor. Perhaps even more useful is a binary diff tool that can show exactly the offset of each difference. Because some ASCII data gets encoded in the binary simulation state, it is often possible to determine which component and property contained the difference. |
| 155 | |
| 156 | ==== OOS logging mode ==== |
| 157 | There is another useful option for debugging serialization errors, the `-ooslog` option. This will dump the simulation state in both binary and debug form every turn. The game will run much slower with this option, but for multiplayer testing, it can be useful to determine exactly when the states differed between two players, and how they differed. |
127 | | === Known causes of OOS === |
128 | | The following are known, reproducible causes of OOS errors that currently aren't solved: |
129 | | |
130 | | * Rejoining a multiplayer game with AIs - because the AIs don't fully serialize their state, the rejoining player's state will differ and cause an OOS, see #1089. |
131 | | |
132 | | The following are possible sources of OOS problems we found in the past: |
133 | | |
134 | | * SpiderMonkey has some JIT compiler issues (see #2000 for example). If you want to rule out the JIT compiler, you can disable it by uncommenting the lines in ScriptInterface.cpp that set the options JSOPTION_JIT and JSOPTION_METHODJIT. If it is the JIT compiler, you should still try to find the exact location of the problem because disabling JIT compiling completely is bad for performance. |
135 | | * Uninitialized variables can cause different behaviour because their value depends on random memory content. Valgrind helps detecting these errors (more detailed explanation needed). |
136 | | * Data affecting the simulation is kept past the runtime of one game (see #2285 for example). In this case OOS errors typically occur if one or more players have previously played another game without shutting down the engine afterwards. It's difficult to troubleshoot with replays because they always start a fresh instance of the engine. |
137 | | * Due to SpiderMonkey internals, floating point operations can differ slightly on different machines and different architectures (see [https://bugzilla.mozilla.org/show_bug.cgi?id=531915 Bugzilla #531915]). |
138 | | |
139 | | === Serialization test mode === |
140 | | The engine has a very useful test mode for debugging serialization errors. Using the `-serializationtest` command line option tells the simulation to do a full test of the simulation state ''every turn''. This will be extremely slow compared to a normal run of the game, but can reveal problems the moment they occur. '''Note:''' it will currently always fail with AI players, since they don't serialize properly.[[BR]]Note: sometimes, OOS only happen over several turns. This makes them unreproducible with the basic serializationtest. In this situation, use the patch in #4242. It will start the second simulation at a given turn, and form then on will update it as if it were a second player. |
141 | | |
142 | | When the serialization test fails, an error window will be shown, but the more useful data is created in the game's log folder (see GameDataPaths) inside `oos_log`. In the following filenames, '''.a''' means data from the primary simulation state (e.g. the one used in a typical game), while '''.b''' is data from the secondary simulation state (e.g. the one being reconstructed every turn to compare with the primary state). When they differ, it's a serialization test failure. |
143 | | |
144 | | * `debug.after.a` / `debug.after.b` - A good place to begin, these are the debug output of the serializer '''after''' the current turn updates and can be compared with a diff tool (like `oos_dump.txt` for OOS errors). |
145 | | * `state.after.a` / `state.after.b` - Binary dump of the simulation state '''after''' the current turn update occurs. If an error occurs here, it is probably because some data that affects the simulation state isn't being serialized. |
146 | | * `state.before.a` / `state.before.b` - Binary dump of the simulation state '''before''' the current turn update occurs. If an error occurs here, it's probably a bug in the (de)serializer or the way it is (de)serializing the data. |
147 | | * `hash.after.a` / `hash.after.b` / `hash.before.a` / `hash.before.b` - hash values used to compare the above states. |
148 | | |
149 | | If the debug dumps aren't helpful, the binary dumps can be viewed with a hex editor. Perhaps even more useful is a binary diff tool that can show exactly the offset of each difference. Because some ASCII data gets encoded in the binary simulation state, it is often possible to determine which component and property contained the difference. |
150 | | |
151 | | === OOS logging mode === |
152 | | There is another useful option for debugging serialization errors, the `-ooslog` option. This will dump the simulation state in both binary and debug form every turn. The game will run much slower with this option, but for multiplayer testing, it can be useful to determine exactly when the states differed between two players, and how they differed. |
153 | | |
154 | | === Replay mode === |
155 | | Replays are created for every game session, see `replays` in the userdata directory (GameDataPaths). They are organized by date. `commands.txt` stores not only commands sent by each player each turn, but periodically the hash of the simulation state on a given turn. Replay mode is activated with the following command line option: `-replay=/path/to/commands.txt`. It's best to run the replay in a debugger (Windows) or command prompt (*nix) to view its output, no separate window is created. |
156 | | |
157 | | '''Important:''' Don't forget to specify which mod(s) to load with the command `mod=public` (or whatever mod you want to load) when running replays. Else the loader won't be able to find the mod-dependant data (like maps, pathfinding codes, etc). |
158 | | |
159 | | When replaying a `commands.txt`, the hashes are checked periodically. This can also be used to verify that no breaking changes are introduced by a new patch. Replay mode also creates a `profile.txt` with profiling information every 20 turns, this can be processed into a nice graph as explained [wiki:EngineProfiling#Creatinggraphs here]. |
160 | | |
161 | | === Other tips for serializer debugging === |
| 173 | ==== Other tips for serializer debugging ==== |