Version 3 (modified by historic_bruno, 11 years ago) ( diff )

--

Debugging

As debugging is a very complex topic, this page will compile tips specifically for debugging 0 A.D. problems. For info about profiling the game, see EngineProfiling.

Tools

The tools you use for debugging depend on your operating system of choice.

Windows

  • Visual Studio - the basic tool for debugging the game on Windows. Break into the debugger on a breakpoint, on a crash or assertion failure, or any other time. Visual C++ Express is free and contains similar debugging features. Can be used to analyze crash dumps and get a useful call stack.
  • WinDbg - part of the Windows SDK, a very powerful debugging suite which is primarily command line driven, unlike Visual Studio. Analyze crashes in more detail than VS.
  • DebugView - If you don't run the process in a debugger, DebugView lets you view its normally hidden debug output. Users can install and run this much more easily than a full debugging suite.
  • VMMap - Free tool from Microsoft to analyze the virtual memory usage of a process; shows fragmentation, can be useful for observing memory leaks or finding why a large allocation fails.
  • gDEBugger - Debug and profile OpenGL applications. Useful for debugging GL errors and finding unexpected behavior.
  • Notepad++ - small, simple, powerful text editor. You need a decent text editor on Windows.
  • A hex editor, like Bless - useful for examining binary simulation state dumps, either for saved games or serialization errors.

TODO: I know there are more...

Linux

  • gdb - the basic tool for debugging in the GNU toolchain. gdb lets you e.g. break into the engine on a breakpoint or when a seg fault or assertion failure occurs. You can start your process in gdb or attach gdb to a running process.
  • valgrind - debugging and profiling suite. Find memory leaks, invalid memory accesses and more.
  • gDEBugger
  • A hex editor

OS X

  • Xcode - free IDE for development on OS X, also has suite of debugging tools.
  • gdb

Debugging Crashes

You want two things when debugging crashes: 1) steps to reproduce, and 2) a call stack (i.e. back trace).

Reproducing the crash

Important info to gather:

  • Build environment - custom build, SVN autobuild, or release package? Which compiler version?
  • Hardware (e.g. system_info.txt)
  • Operating system (e.g. system.info.txt)
  • Which version of the game was the user playing?
  • What was the user doing when the crash occurred?
  • Where there any errors or visible problems before the crash? (e.g. interestinglog.html)
  • What are the minimal steps to get the crash? Is it consistent?

There are many different causes of crashes (running out of memory, heap corruption, invalid pointers or iterators, buggy drivers - to name only a few). Sometimes they are only reproducible with one OS, one type of GPU drivers, one version of a compiler, one type of CPU, or only after following a complex series of steps. Collecting the above information can point you and others in the right direction.

Sometimes, but not always, when the game crashes it will create a crashlog.txt in the log directory (see GameDataPaths). This file contains basic system info and often a call stack. On Windows, a crashlog.dmp may also be created, from which a call stack can be obtained.

Call stack on Windows

Reading crashlog,dmp:

  • It can be opened in Visual Studio or WinDbg. For this to be useful, you need to have the debug symbols and source code matching the affected build of the game. Most users use the autobuild version of the game, you can simply download the correct autobuilt binaries from SVN. Or for a release, install that particular version of the game and acquire source packages from http://releases.wildfiregames.com/. In the future we could automate this process, see #290.
  • You also need to set up symbol paths and a cache location for Microsoft symbol server, see Use the Microsoft Symbol Server to obtain debug symbol files.
  • In Visual Studio, after setting up your debug symbol paths, open the crashlog.dmp and choose to debug natively. You should get a crash of some kind, then you can break into the debugger. The call stack window will show you which functions were being called at that point. Note that in release builds, some data will be optimized out and not easily viewable.
  • In WinDbg, after setting up your debug symbol and source code paths, open the crash dump and use the ~*kp command to get a full call stack of each thread. See this extremely helpful article for more useful commands in WinDbg. For example, .frame 3 lets you set the current stack frame to #3 (the 3rd from the top of the call stack), then you can e.g. use the source code window to see exactly the line of code matching this function call, and locals window to see the variables in that function. Note that WinDbg can often open dump files that VS fails to open.

Without crashlog.dmp, the process is slightly more difficult:

  • If crashlog.txt was created and it contains a call stack, you may have enough information there to find the source of the crash.
  • However, if the crashlog or stack dump failed for some reason, you either need to reproduce the crash locally, or get the user to run the game in a debugger (e.g. WinDbg).
  • A third option on newer versions of Windows is to have the user create a memory dump from Windows task manager. The user can find pyrogenesis.exe in the task manager, right-click it and choose Create Dump File. Beware the resulting MEMORY.DMP will be very large as it contains all memory pages being accessed by the process at the time, but it may be compressed with e.g. 7-Zip down to a more reasonable size.

Call stack on Linux / OS X

On Linux or OS X, you won't have a crash dump, so the best tool for getting your call stack is gdb. Like Windows, you need symbols to be set up properly to make sense of what gdb tells you. If you just see a bunch of hex numbers (addresses of functions) and no names of functions, then you don't have symbols set up correctly. If you're using a release package of the game on Linux, the symbols may have been omitted to reduce the package size, but there may be an optional debug package that can be installed (e.g. on Debian and Ubuntu). Note: debug symbols do NOT require a debug build of the game. A debug build just disables optimizations to make some debugging easier, both a release build also includes debug symbols.

Once you have debug symbols, if you can reproduce the crash, run the game in gdb (e.g. gdb ./pyrogenesis) and make it crash. Then you will return to the gdb command line. The most helpful command is bt to get the backtrace (call stack) at the moment of the crash. Often it's even more helpful to use t a a bt full to get the full backtrace of all running threads in the game. set height 0 is useful if gdb keeps prompting you to continue, when viewing a long backtrace.

gdb is quite powerful and has more features than can be reasonably explained here, so check the manual or search for tutorials. One thing you might find useful is selecting the current stack frame with frame n, e.g. frame 0 is the top of the stack. Then you can use info locals to view local variables in that stack frame. Note that in a release build, many of these will be optimized out, or the structure may be too complex for gdb to understand.

Debugging Out of Sync and Serialization Errors

Out of sync

Out of sync (OOS) and serialization errors are generally difficult to debug, but knowing where to look can make this process simpler. An OOS error occurs in a multiplayer game when one player's serialized simulation state isn't identical to another player's serialized simulation state (breaking the concept of network synchronization). The following data are useful to collect in this case:

  • oos_dump.txt - a human readable snapshot of the simulation state at the point of OOS, created for each player in the game. Found in the logs folder, see GameDataPaths.
  • Each player's game version - these have to match, while the game is in alpha phase the simulation changes constantly and there is no backward compatibility. For releases, it means using the same alpha release, for SVN users, it means using the same SVN revision (with few exceptions).
  • OS and hardware info for each player (system_info.txt) - Some serialization bugs are platform specific, so knowing the systems involved is key to reproducing the error.
  • commands.txt - this is the commands issued by each player during the game, which can be used to replay the game exactly as it happened.

The easiest place to begin is doing a simple text diff of the oos_dump.txt files to see where they differ. In Windows, you can use the diff tool built into TortoiseSVN, TortoiseGit, or some other tool, on *nix you simply use 'diff'. Your diff tool may not like the binary data spit out by the CCmpAIManager component, so you can remove that by editing manually in a text editor. Note that in multiplayer games, a hash of the full simulation state only occurs every 20 turns for performance reasons, so it's possible that the states began diverging earlier!

In the best case, you will see a small diff of changes comparing two or more of the dump files. This can point you to the component and property that differ, then by analyzing the code that writes to that property, you can see if it does anything unsafe, or (in the case of a C++ component) if it's not serialized correctly.

However, it may be that you won't see any diff, or maybe it will be huge and affect many entities and components. If there's no diff, that means the simulation state differed, but the difference doesn't affect the debug serializer. There are a few reasons why this could happen, but most likely the JSON representation of the dump doesn't allow the actual value or (for a JavaScript value) the difference is in SpiderMonkey's internal representation of the data. This has been reported before with e.g. NaN, having a single bit difference depending on the JIT behavior (see #1879).

Known causes of OOS

The following are known, reproducible causes of OOS errors:

  • Rejoining a multiplayer game with AIs - because the AIs don't fully serialize their state, the rejoining player's state will differ and cause an OOS.
  • Multiplayer games with Aegis AI

Serialization test mode

The engine has a very useful test mode for debugging serialization errors. Using the -serializationtest command line option tells the simulation to do a full test of the simulation state every turn. This will be extremely slow compared to a normal run of the game, but can reveal problems the moment they occur. Note: it will currently always fail with AI players, since they don't serialize properly.

When the serialization test fails, an error window will be shown, but the more useful data is created in the game's log folder (see GameDataPaths) inside oos_log. In the following filenames, .a means data from the primary simulation state (e.g. the one used in a typical game), while .b is data from the secondary simulation state (e.g. the one being reconstructed every turn to compare with the primary state). When they differ, it's a serialization test failure.

  • debug.after.a / debug.after.b - A good place to begin, these are the debug output of the serializer after the current turn updates and can be compared with a diff tool (like oos_dump.txt for OOS errors).
  • state.after.a / state.after.b - Binary dump of the simulation state after the current turn update occurs. If an error occurs here, it is probably because some data that affects the simulation state isn't being serialized.
  • state.before.a / state.before.b - Binary dump of the simulation state before the current turn update occurs. If an error occurs here, it's probably a bug in the (de)serializer or the way it is (de)serializing the data.
  • hash.after.a / hash.after.b / hash.before.a / hash.before.b - hash values used to compare the above states.

If the debug dumps aren't helpful, the binary dumps can be viewed with a hex editor. Perhaps even more useful is a binary diff tool that can show exactly the offset of each difference. Because some ASCII data gets encoded in the binary simulation state, it is often possible to determine which component and property contained the difference.

OOS logging mode

There is another useful option for debugging serialization errors, the -ooslog option. This will dump the simulation state in both binary and debug form every turn. The game will run much slower with this option, but for multiplayer testing, it can be useful to determine exactly when the states differed between two players, and how they differed.

Replay mode

Replays are created for every game session, see sim_log in the game's log directory (GameDataPaths). They are organized by process ID, you might find sorting them by modified date helps. commands.txt store not only commands sent by each player each turn, but periodically the hash of the simulation state on a given turn. Replay mode is activated with the following command line option: -replay=/path/to/commands.txt. It's best to run the replay in a debugger (Windows) or command prompt (*nix) to view its output, no separate window is created.

When replaying a commands.txt, the hashes are checked periodically. This can also be used to verify that no breaking changes are introduced by a new patch. Replay mode also creates a profile.txt with profiling information every 20 turns, this can be processed into a nice graph as explained here.

Other tips for serializer debugging

  • Debug annotations can prove very helpful when viewing the binary simulation state in a hex editor. The dump will be much larger but it will also contain more textual data. Set DEBUG_SERIALIZER_ANNOTATE to 1 in StdSerializer.h.
  • Check hashes more frequently. By default, the game balances e.g. OOS checks with the performance impact of serializing and hashing the state. You might find it helpful to change e.g. CReplayPlayer::Replay in Replay.cpp to check hashes more frequently in replay mode, you can also generate hashes more frequently by changing CNetTurnManager::TurnNeedsFullHash (multiplayer games) or CNetLocalTurnManager::NotifyFinishedUpdate (single player games) in NetTurnManager.cpp.

Debugging script errors

A generic JavaScript debugger wouldn't really work for the game's scripts, so instead we created our own debugger, see JavascriptDebugging for more details.

If the JS debugger doesn't help, you can always fall back to the basic debugging "by hand" technique of inserting logging functions into the suspected problematic code, and inspecting data to narrow down the cause.

Out of memory errors

Our JavaScript engine, SpiderMonkey, uses a simplistic garbage collection (GC) to manage script memory. Scripts run in contexts, with one or more contexts per runtime (data can be passed between contexts but not trivially between runtimes - they might be in different threads). Each runtime is initialized with a fixed size, like 16MB or 128MB, and the GC runs periodically to free up allocated memory. Because of this, it's entirely possible that a script runs out of available memory and begins failing. This is an out of memory (OOM) error and is also an uncatchable error.

TODO: document heap dump?

Debugging C++ / engine errors

Because C++ has to be compiled when changes are made, the debugging "by hand" technique is still useful but slightly more annoying than when debugging scripts. It's often more convenient to simply run the game in a debugger like Visual Studio or gdb, either a release or debug build, and set breakpoints to break into the debugger and get a sense of what is happening in the program. The process for this is similar to that described above for debugging crashes.

TODO: links to C++ debugging tutorials/books?

Test suite

Always run the test suite. It gets built along with the game, as either test.exe on Windows or test on *nix. The test suite contains simple cases that might make it more obvious where a bug is located and why it occurs. Of course a failure in the test suite might also mean the test suite needs updating :)

Note: See TracWiki for help on using the wiki.