Opened 11 years ago

Closed 11 years ago

Last modified 11 years ago

#1729 closed defect (fixed)

Timer-related problem in Atlas (timer apparently not monotonic as spec'd)

Reported by: B. Guns Owned by: ben
Priority: Should Have Milestone: Alpha 12
Component: Atlas editor Keywords: timer
Cc: Jan Wassenberg Patch:

Description

This happened to me in the latest revision but I experienced it as well about a week ago.

Attachments (2)

crashlog.txt (19.0 KB ) - added by B. Guns 11 years ago.
system_info.txt (12.5 KB ) - added by B. Guns 11 years ago.

Download all attachments as: .zip

Change History (29)

by B. Guns, 11 years ago

Attachment: crashlog.txt added

by B. Guns, 11 years ago

Attachment: system_info.txt added

comment:1 by Erik Johansson, 11 years ago

Milestone: BacklogAlpha 12
Priority: Should HaveRelease Blocker

in reply to:  description comment:2 by B. Guns, 11 years ago

Replying to plumo:

This happened to me in the latest revision but I experienced it as well about a week ago.

After I click 'surpress' twice I'm able to use the Atlas Editor though. So I'm not sure that it is a proper CTD-crash...

comment:3 by historic_bruno, 11 years ago

Are you running Atlas through the main menu by chance?

When I do that, I don't get a visible error message, but I do get an assertion failure sometimes:

GameLoop.cpp(149): Assertion failed: "realFrameLength >= 0.0f"

Atlas fails to load, but the error window is never shown due to some other error. I wonder if the crashlog you have there is only the secondary error, not the original one.

Can you try the following:

  1. Download and run Microsoft's free DebugView utility.
  2. Run Atlas as you were trying to before. Hopefully you still get the error.
  3. Copy and paste the contents of DebugView here in your reply.
Last edited 11 years ago by historic_bruno (previous) (diff)

in reply to:  3 comment:4 by historic_bruno, 11 years ago

Replying to historic_bruno:

When I do that, I don't get a visible error message, but I do get an assertion sometimes:

GameLoop.cpp(149): Assertion failed: "realFrameLength >= 0.0f"

About this error, I have a possible fix that involves replacing Atlas' frame time calculation with the frequency filter technique used in main.cpp. I don't know yet if that's the same error you have, hopefully we will find out soon :)

comment:5 by B. Guns, 11 years ago

  1. Yes I'm using Atlas via main menu.
  2. I get two errors: the one reported above and also a second one. Sadly I only found the last error in the log files.
  3. After that I can use Atlas without any problem (so it seems at least).

PS: debugview doesn't give me a report by just being opened. Do I click somewhere first?

Last edited 11 years ago by B. Guns (previous) (diff)

in reply to:  5 comment:6 by historic_bruno, 11 years ago

Replying to plumo:

PS: debugview doesn't give me a report by just being opened. Do I click somewhere first?

Maybe it needs to be run as Administrator?

comment:7 by historic_bruno, 11 years ago

plumo, did you have a chance to look into this?

comment:8 by ben, 11 years ago

In 12902:

Attempted fix to prevent strange timer-related bug in Atlas (fails to start when opened from the game's main menu), refs #1729

comment:9 by B. Guns, 11 years ago

I just compiled the latest build and the problem seems to have disappeared. No more notices :) .

comment:10 by historic_bruno, 11 years ago

Cc: Jan Wassenberg added
Keywords: timer added
Milestone: Alpha 12Backlog
Priority: Release BlockerShould Have
Summary: Error starting up AtlasTimer-related problem in Atlas (timer apparently not monotonic as spec'd)

Glad to hear it! The underlying issue with timers hasn't been fixed though, r12902 is just covering it up, so I'll keep this ticket open and lower the priority (and backlog it).

For reference, Philip and I discussed this in detail on 11-30-2012.

comment:11 by Jan Wassenberg, 11 years ago

Thanks for the heads-up! I am still maintaining this code and happy to have a look.

The original error in the crashlog indicates a failure to register the window class. This might be due to something mentioned in MSDN: "No window classes registered by a DLL are unregistered when the DLL is unloaded. A DLL must explicitly unregister its classes when it is unloaded."

Additionally, someone comments: "it is better to simply ignore the return value from this function and pick up the failure when CreateWindow() is called as that will fail due to 'class not registered'". That seems like a perfectly adequate fix. We have a unique class name (thanks to inclusion of a GUID), so there is no realistic chance of a collision. The error may well be caused by Windows not automatically unregistering the class. I am not aware of any reason not to just reuse the class if already registered. There is a little bit of time to spare this morning, so I will just remove that check. In fact, everything seems to run fine without it, because you can successfully suppress it anyway.

@Plumo: running DebugView will not generate any report - rather it provides a place for the game/Atlas to write debug messages. Those will be displayed as they are generated - if DebugView is running at the particular moment they arise. That means we have to start it before the game is launched. Unfortunately DebugView has some issue on Win7 x64 and crashes occasionally. Very annoying, because I am not aware of a replacement for it.

As to the timer, congratulations on having a CPU and timer fast enough to trigger this issue :) I am sorry I was too short-sighted to realize that float is a pretty useless representation of high-resolution time. The CPU timestamp counter increments at the CPU or at least uncore clock rate and can therefore be less than FLT_EPSILON. I would recommend going on a search+destroy for places where the return value of timer_Time is cast to float, and replacing with double. There should only be a few places such as the ENSURE (which can then be re-instated) and the framerate computation. Ben, would you be inclined to do that?

in reply to:  11 comment:12 by historic_bruno, 11 years ago

Replying to jan:

As to the timer, congratulations on having a CPU and timer fast enough to trigger this issue :) I am sorry I was too short-sighted to realize that float is a pretty useless representation of high-resolution time. The CPU timestamp counter increments at the CPU or at least uncore clock rate and can therefore be less than FLT_EPSILON. I would recommend going on a search+destroy for places where the return value of timer_Time is cast to float, and replacing with double. There should only be a few places such as the ENSURE (which can then be re-instated) and the framerate computation. Ben, would you be inclined to do that?

Hi Jan, thanks for the quick response. I've tried using double instead of float but for some reason the assertion still fails:

double time = timer_Time();
static double last_time = time;
double realFrameLength = time-last_time;
last_time = time;
ENSURE(realFrameLength >= 0.0); 

When I break into the debugger, it says time=realFrameLength=-6.0211987751301876e-006, which looks odd. Could it be some threading or compiler issue? I notice the sound manager thread is frequently accessing timer_Time().

If I break into the debugger after this error, the threads look like this:

Unflagged		5960	0	Worker Thread	CSoundManagerWorker	EnsureMonotonic	Normal
Unflagged		6800	0	Worker Thread	engine_thread	sys_display_error	Normal
Unflagged		6448	0	Main Thread	main	sem_wait	Normal
Unflagged		5680	0	RPC Thread	RPC Callback Thread	7735fd71	Normal
Unflagged		5332	0	Worker Thread	whrt_UpdateThread	UpdateThread	Normal
Unflagged		7088	0	Worker Thread	Win32 Thread	77361f26	Normal
Unflagged		4664	0	Worker Thread	Win32 Thread	77361f26	Normal
Unflagged		3356	0	Worker Thread	Win32 Thread	7735f8b1	Normal
Unflagged		1252	0	Worker Thread	Win32 Thread	7736013d	Normal
Unflagged		3956	0	Worker Thread	Win32 Thread	77361f26	Normal
Unflagged		3924	0	Worker Thread	Win32 Thread	74b978d7	Normal
Unflagged	>	6784	0	Worker Thread	Win32 Thread	7735f8b1	Time Critical
Unflagged		4928	0	Worker Thread	Win32 Thread	77361f26	Normal
Unflagged		6828	0	Worker Thread	Win32 Thread	77361f26	Normal
Last edited 11 years ago by historic_bruno (previous) (diff)

comment:13 by Jan Wassenberg, 11 years ago

In 12908:

fix synchronization of TimerState (refs #1729)

comment:14 by Jan Wassenberg, 11 years ago

OK, that would be the first call with last_time = 0. Threading troubles are a real possibility here. I've looked at EnsureMonotonic, which seems ok, and improved the sync of TimerState (though that should not have caused trouble). If this happens consistently, though, that's probably not threading-related. Would you please let me know the first few lines of debug output, i.e. which HRT is being used?

comment:15 by historic_bruno, 11 years ago

The error doesn't occur with every run, I'd say over 50% of the time though. Oh and it doesn't occur or occurs much less frequently with a debug build.

HRT: activating HPET failed: Unknown error (-100022, 0xFFFFFFFFFFFE794A)
HRT: using name=TSC freq=3392000000.000000
HRT: counter=TSC freq=3.392e+009 res=2.94811e-010 bits=64
Last edited 11 years ago by historic_bruno (previous) (diff)

comment:16 by Jan Wassenberg, 11 years ago

It is expected that TSC would be being used. Bad news that it is sporadic, though. Could be a race condition. Debug builds can also have an influence - they flush FPU regs to float and therefore truncate the values more than optimized release builds. The HRT module does not use floats, though.

To narrow this down, would you please ENSURE() the argument to EnsureMonotonic is non-negative? Might be good to printf the values before and after, as well as GetCurrentThreadId.

comment:17 by historic_bruno, 11 years ago

The new ENSURE is passing, and it appears that the value t is identical before and after calling EnsureMonotonic. However, putting the printfs in there really slows it down and the original error doesn't occur.

Does this look correct though? (mostly that a result returned later in time has an earlier timer value)

before t=0.38157664681603776000, thread=24
before t=0.38185578449292457000, thread=504
after t=0.38185578449292457000, thread=504
after t=0.38157664681603776000, thread=24

comment:18 by historic_bruno, 11 years ago

Cc: stwf added

Also, Jan what do you think about it not displaying the error window? The assertion failure causes Atlas to simply freeze so that I have to break into it with a debugger. Here's what the engine thread is doing at that point:

 	user32.dll!_NtUserCallHwndParamLock@12()  + 0x15 bytes	
 	user32.dll!_NtUserCallHwndParamLock@12()  + 0x15 bytes	
>	pyrogenesis.exe!sys_display_error(const wchar_t * text=0x03060000, unsigned int flags=6)  Line 315 + 0x13 bytes	C++
 	pyrogenesis.exe!debug_DisplayError(const wchar_t * description=0x09729998, unsigned int flags=4, void * context=0x0335ec80, const wchar_t * lastFuncToSkip=0x006aa620, const wchar_t * pathname=0x0132e010, int line=149, const char * func=0x0132e080, volatile int * suppress=0x013803a4)  Line 469 + 0x18 bytes	C++
 	pyrogenesis.exe!debug_OnAssertionFailure(const wchar_t * expr=0x0132dfe0, volatile int * suppress=0x013803a4, const wchar_t * file=0x0132e010, int line=149, const char * func=0x0132e080)  Line 558 + 0xa7 bytes	C++
 	pyrogenesis.exe!RunEngine(void * data=0x003cf6e4)  Line 149 + 0x29 bytes	C++
 	pyrogenesis.exe!thread_start(void * param=0x00486798)  Line 624 + 0x3 bytes	C++
 	msvcr100.dll!__endthreadex()  + 0x3a bytes	
 	msvcr100.dll!__endthreadex()  + 0xe4 bytes	
 	kernel32.dll!@BaseThreadInitThunk@12()  + 0x12 bytes	
 	ntdll.dll!___RtlUserThreadStart@8()  + 0x27 bytes	
 	ntdll.dll!__RtlUserThreadStart@8()  + 0x1b bytes	

The responsible call seems to be here:

INT_PTR ret = DialogBoxParamW(hInstance, lpTemplateName, hWndParent, dlg_OnMessage, (LPARAM)&params);

comment:19 by Jan Wassenberg, 11 years ago

mm, the available time dropped again :/ *sigh* You are correct that DialogBoxParamW is a modal dialog, which totally disables the app window for the duration of its message loop. This is by design and usually a good thing (don't want the app to continue) - but probably bad for Atlas. It is easily possible to override this with some wxW stuff for Atlas - have a look at ah_display_error. That might be a good thing to do :)

The trouble with printf and debug_printf is that they take locks. Besides the speed hit, that will also serialize threads and change execution. It really sounds like this is a race condition. I've been developing on 64-bit CPUs for the past few years, and this code seems to assume 64-bit reads are atomic. That does not hold for our 32-bit app. Please try the simple and straightforward solution of using a pthread_mutex (with PTHREAD_MUTEX_INITIALIZER) in that function to serialize updates to the "max time", which should then be a double.

Last edited 11 years ago by Jan Wassenberg (previous) (diff)

comment:20 by historic_bruno, 11 years ago

Thanks Jan! Your mentioning ah_display_error pointed me to comments in the source code and some discussions you had with Philip back in 2006 about these very issues :) I guess one thing that has changed since then is AtlasUI and engine threads have been swapped and that requires a more thorough shutdown of the (old) engine when starting Atlas from in-game.

Sorry, I don't quite understand how to place the mutex locks in EnsureMonotonic. You think it is cpu_CAS64 that is causing the problem, or this line:

 const i64 oldRepresentation = maxRepresentation;

Should I leave it as i64 or change to double? The workings of this code is a bit over my head :)

comment:21 by Jan Wassenberg, 11 years ago

I'm glad comments were helpful - is important to document as much as possible. cpu_CAS64 should be safe - if that is broken, we are all kinds of toast. i64 oldRepresentation is indeed the culprit. Let's change to double and write something like this:

lock (ideally via scoped lock holder, I forget whether we have one of those)
static double maxTime;
maxTime = std::max(maxTime, time);
return maxTime (and unlock)

comment:22 by historic_bruno, 11 years ago

Cc: stwf removed

OK, here's what I've got, it seems to work:

static pthread_mutex_t ensure_monotonic_mutex = PTHREAD_MUTEX_INITIALIZER;
// NB: does not guarantee strict monotonicity - callers must avoid
// dividing by the difference of two equal times.
static void EnsureMonotonic(double& newTime)
{
    pthread_mutex_lock(&ensure_monotonic_mutex);
    static double maxTime;
    maxTime = std::max(maxTime, newTime);
    newTime = maxTime;
    pthread_mutex_unlock(&ensure_monotonic_mutex);
}

comment:23 by Jan Wassenberg, 11 years ago

Looks good, thanks! That should fix it *thumbsup*

comment:24 by ben, 11 years ago

Owner: set to ben
Resolution: fixed
Status: newclosed

In 12927:

Fixes race condition in EnsureMonotonic on 32-bit builds, which was causing unreliable timer behavior, fixes #1729

comment:25 by historic_bruno, 11 years ago

Milestone: BacklogAlpha 12

comment:26 by historic_bruno, 11 years ago

Thanks for the help sorting that out, Jan :)

comment:27 by ben, 11 years ago

In 13724:

Fixes stack overflow in stack dump logic, by special-casing boost::unordered types, fixes #1813.
Adds INIT_HAVE_DISPLAY_ERROR init flag to correctly override ah_display_error in Atlas (InitVfs was overriding it again), but don't use it because it's ugly and broken, refs #1729

Note: See TracTickets for help on using tickets.