Opened 7 years ago

Closed 7 years ago

#4320 closed defect (fixed)

Replay Menu: Support non-ASCII characters

Reported by: elexis Owned by: elexis
Priority: Release Blocker Milestone: Alpha 22
Component: Core engine Keywords:
Cc: Patch: Phab:D518

Description (last modified by barho0om_55)

Some users, f.e. with czech characters in the replay path run into that case:

Couldn't open %s. Non-latin characters are not supported yet

Thus the replay menu is entirely empty to affected users.

It occurs in VisualReplay::LoadReplayData:

	const CStr fileName = replayFile.string8();
	std::ifstream* replayStream = new std::ifstream(fileName.c_str());

Previously utf8_from_wstring(fileName.string()) and OsString(filePath).c_str() have been used, see r17314 r17356. It might be a bug in any of the string conversion functions, or even FileExists

Any proposed solution should be tested with all operating systems.

Reported by Bouddica.

Change History (13)

comment:1 by leper, 7 years ago

Summary: Replay Menu: Support non-ASCI charactersReplay Menu: Support non-ASCII characters

Does loading savegames or config files work? If yes, then maybe using the same way of getting files might help.

comment:2 by elexis, 7 years ago

Savegame loading and config loading does work for Bouddica.

The issue is reproducible on linux by adding a replay subdirectory, named with non-ASCII letters, containing a non-empty commands.txt. Most, if not all distributions prevent adding users with exotic characters, so the bug should become only apparent on unix if the player rename their directories with those chars (hence the phrasing of error message).

We can notice that the GetFileInfo call in VisualReplay.cpp can resolve the given path correctly, but the std::ifstream using the string8() version doesn't. The Vfs callstack reveals that given the current state of our file library functions, OsString was indeed the correct way to resolve the path, not string8.

Call stacks: Config Saving:

  • CConfigDB::WriteFile from ConfigDB.cpp
  • CreateFile from vfs.cpp
  • Store from PRealDirectory from real_directory.cpp
  • Store from io.h
  • Open from file.h
  • FileOpen from file.cpp
  • wopen

Screenshots Saving:

  • WriteScreenshot from Util.cpp
  • tex_write from Util.cpp
  • CreateFile from vfs.cpp
    • see above
  • Then it uses GetRealPath from vfs.cpp to translate the Vfs path to an OsPath, applies string8() and shows the full path using LOGMESSAGERENDER and debug_printf

Loading vfs files ends up with the same callstack leading to wopen.

comment:3 by leper, 7 years ago

string8 was only meant for providing something that can be displayed, never as input to anything operating on the file system, since valid paths are just arbitrary bytes, not required to be actually valid utf8 (or any other encoding).

But if you notice the common code path both of these share you might notice that using the vfs might have made this a non-issue.

comment:4 by elexis, 7 years ago

From path.h:

	/**
	 * Return a UTF-8 version of the path, in a human-readable but potentially
	 * lossy form. It is *not* safe to take this string and construct a new
	 * Path object from it (it may fail for some non-ASCII paths) - it should
	 * only be used for displaying paths to users.
	 */
	std::string string8() const
	{
		// TODO: On Unixes, this will only be correct for ASCII or ISO-8859-1
		// encoded paths; we should probably assume UTF-8 encoding by default
		// (but take care to handle non-valid-UTF-8 paths safely).

		return utf8_from_wstring(path);
	}

Searching for string8 yields 146 results, of all except these occurances in the following 2 files are LOGERROR and debug_printf calls:

Exceptions:

  • VisualReplay.cpp: all ocurrances call std::ifstream
    • StartVisualReplay calling CGame::StartVisualReplay
    • LoadReplayData to load the commands.txt
    • GetReplayAttributes to load the first line of the commands.txt
    • GetReplayMetadata to load metadata.json
    • SaveReplayMetadata
  • ScriptInterface.cpp:
    • LoadScript
    • LoadGlobalScripts
    • LoadGlobalScriptFile
    • ReadJSONFile

Not sure why that string8 proxy is needed in the first place - utf8_from_wstring could just be used everywhere and the comment could go to that function. Though changing that will be a megapatch without any gain.

Searching for utf8_from_wstring yields further filepath calls with questionable UTF8 conversions.

  • archive_zip.cpp in utf8_from_wstring(pathname.string()) calls in seem suspicious.
  • TerrainTextureEntry.cpp has m_Tag = utf8_from_wstring(path.Basename().string());
  • GUItext.cpp uses above tag to check whether IconExists and to load m_Font
  • L10n.cpp extracts the locale code via utf8_from_wstring
  • JSI_Mod::GetAvailableMods uses utf8_from_wstring on the path
  • The TemplateLoader gets passed utf8_from_wstring(path) and uses wstring_from_utf8 to revert that before using the Vfs

So it seems this UTF8 error could occur in several places if we were to use non-ASCI characters in the public mod.


I didn't forget the Vfs topic. At least for the replay menu part we could do this (the replay writing still needs an open stream that is closed when ending the game).

But before, we have to evaluate whether there will be a performance impact. Loading some hundred replays can take minutes with an HDD, even just checking the filesize of all commands.txt took about 20s (when the files aren't cached by the HDD yet). So it seems we would have to mount the files at the begin of VisualReplay::LoadReplayData and unmount again before leaving the function. This is also needed as we want to refresh the replay list when reloading entering the GUI page. (Screenshots or savegames for example are also overwritten if the user changes the savegames or screenshots in the directory manually or with a second instance iirc). #3433 has a reviewable patch that reduces the loading time from 2 minutes to 2 seconds by caching everything into a replayCache.json file and skipping as much I/O as possible. Furthermore we have to check whether we lose another I/O performance advantage you had suggested back in the day (reading only the last line of the commands.txt to determine the replay duration).

I agree that consistent file loading would be an improvement. I have to see and test the diff before I can give a fully informed decision.

comment:5 by elexis, 7 years ago

Milestone: BacklogAlpha 22
Owner: set to elexis

in reply to:  4 comment:6 by leper, 7 years ago

Replying to elexis:

Not sure why that string8 proxy is needed in the first place - utf8_from_wstring could just be used everywhere and the comment could go to that function. Though changing that will be a megapatch without any gain. Searching for string8 yields 146 results, [...]

One would think that you just answered your own question, otherwise you're welcome to go back to read the commit that introduced that, and the surrounding discussion.

  • ScriptInterface.cpp:
    • LoadScript
    • LoadGlobalScripts
    • LoadGlobalScriptFile

All of those use it to get a string that will be reported when we get an error while executing that script, which is already loaded at that point and passed as another parameter.

  • ReadJSONFile

Didn't find any non-debug use there.

Searching for utf8_from_wstring yields further filepath calls with questionable UTF8 conversions.

  • archive_zip.cpp in utf8_from_wstring(pathname.string()) calls in seem suspicious.

Our ZIP code does not (at least from a quick glance) support UTF-8 file names, which was only added to the ZIP specification in 2007, also all .zip files we load should thus really only be ASCII, nothing against actually ensuring that. (Or extending it, or in case that is a lot of work using other code to do that work, though it is working fine so far.)

  • TerrainTextureEntry.cpp has m_Tag = utf8_from_wstring(path.Basename().string());

That is used to be written to map files, which means that there are likely some more limits on what can go there. Could be changed, but might be some work for no real benefit, as we control (or can define as valid) only files that are actually ASCII.

  • GUItext.cpp uses above tag to check whether IconExists and to load m_Font

This is again using only files in mods, so controlled by us.

  • L10n.cpp extracts the locale code via utf8_from_wstring

The locale code is ASCII, and mod folders/names should be too. Also we do control those files, so no huge issue here.

  • JSI_Mod::GetAvailableMods uses utf8_from_wstring on the path

We need to store that in the config, so a few things are not valid (\n, might need escaping, etc) and at some point users shouldn't need to create folders there (#4207). So we can just require mod folders (the displayed name can be arbitrary anyway) to be [-_A-Za-z0-9] or something like that.

However we should actually check that the mod folder can be stored in the config, so some work needs to be done there. If that check fails, we should display an error/warning message and hide the mod.

  • The TemplateLoader gets passed utf8_from_wstring(path) and uses wstring_from_utf8 to revert that before using the Vfs

Those paths are coming from the simulation, so we do control them.

So it seems this UTF8 error could occur in several places if we were to use non-ASCI characters in the public mod.

So the fix for most of these is to add a few sanity checks that tell people that do such stupid things to not do that.

Now why are most of these issues above different from the replay issue? Because we do not control the path (and in this case we have a full path, not a vfs path, which only gives us a subset of what we want and control) in the replay case, and must thus expect everything. (We should still strive for the same thing in other code, but most of those don't seem urgent.)

But before, we have to evaluate whether there will be a performance impact. [...]

Mount once and keep it mounted; use hotloading, change the handler that causes hotloading to occur to also handle additions; rely on the vfs not populating whole directory trees, unless someone actually needs them; notice that we can extend the vfs interface to do only partial reads if we really need it; actually benchmark it.

comment:7 by elexis, 7 years ago

Priority: Must HaveRelease Blocker

Doesn't have to stay a release blocker, but the replay menu is completely broken to the affected people.

comment:8 by elexis, 7 years ago

Owner: elexis removed

comment:9 by wraitii, 7 years ago

I'm running into a similar issue for the campaigns; the WriteJSONFile function will output a different file name (presumably because it's UTF16?) if you input strange characters (such as éèç) for the campaign name. I could work around this, but it's annoying.

However I'm not sure where it goes wrong? Does VFS support UTF8-filenames?

comment:10 by barho0om_55, 7 years ago

Description: modified (diff)

i'm having the same issue so a simple fast easy good solution is to change the replays and/or save path. i do not know how to so please tell me how i can change replays reading path. thank you

comment:11 by elexis, 7 years ago

Patch: Phab:D518

comment:12 by elexis, 7 years ago

Keywords: beta removed

In r19823:

Fix OsPath string8 function on Unix to account for characters that aren't covered by ISO-8859-1, thus allow proper printing of such paths.

Fixes #4647
Refs #4320 D518
Patch By: Philip
Tested By: Imarok on Windows, wraitii on OSX

comment:13 by elexis, 7 years ago

Owner: set to elexis
Resolution: fixed
Status: newclosed

In 19824:

Fix the replay menu for people with non-latin characters in their username.

Use OsPath instead of CStr and CStrW where possible,
wstring_from_utf8(OsPath.string8()) to pass printable strings to the JSAPI,
OsString when opening a filestream and
off_t instead of int for filesizes.

Fixes #4320
Differential Revision: https://code.wildfiregames.com/D518
Reviewed By: Imarok
Tested By: Imarok on Windows, wraitii on OSX
Special thanks to Philip for advice and the lib/path.h fix in rP19823.

Note: See TracTickets for help on using tickets.