Directory listing of extended ascii in windows

4 次查看(过去 30 天)
EDIT: This question raised some interesting issues but I don't consider it to be answered. Based on feedback from this question I have asked a similar question with a much more specific task, http://www.mathworks.com/matlabcentral/answers/86186-working-with-unicode-paths.
ORIGINAL: Hi all,
I have a filename with 'é' in it. Dir() doesn't work and reports this as two separate characters, 'e´'. I'm using Win 7. Is there a setting I can change in Matlab or Windows to get this to work right? If I use Java things seem to work fine:
my_java_dir = java.io.File(my_dir);
file_list = my_java_dir.listFiles();
I'd rather things "just work" instead of using Java.
Thoughts?
Thanks, Jim
EDIT: This is a summary of some of the comments:
The code I am running is:
temp = dir(my_path);
file_name = temp([#]).name
For a file on windows automatically generated using a proprietary program, the file name includes the following character, 'é'
In Matlab however, file_name contains the following chars instead: 'e´'
From what I can tell, using native Matlab functionality, it is not possible to read a non 7-bit ascii file on a mac:
EDIT: I did not realize this was going to be as difficult to actually accomplish (i.e. to answer properly) as it has turned out to be. The details of some of the tests I have run have become a bit lost in the comments although at this point they are not relevant to a solution. At this point I don't consider the problem to be solved but I don't even have a test framework for trying to solve this problem! When I get a chance I'll be uploading an example file for people to test. Thanks.
  10 个评论
Walter Roberson
Walter Roberson 2013-8-26
After scanning a bit through the decomposition / recomposition document, my head hurts!
Jan
Jan 2013-8-27
编辑:Jan 2013-8-27
@Jim: This is an important question and equivalent problems will occur in the work of many users. The humor-looking part of my replies is caused by frustration after struggling with Unicode too long. But the problem is serious and my suggestion to avoid non-ASCII is also.

请先登录,再进行评论。

回答(2 个)

Walter Roberson
Walter Roberson 2013-8-24
What is the underlying file system type of the directory you are trying to work with? If it is not NTFS then you have a problem; see http://msdn.microsoft.com/en-us/library/windows/desktop/dd317748%28v=vs.85%29.aspx
  3 个评论
Walter Roberson
Walter Roberson 2013-8-25
Could you show the result of adding (numeric) 0 to the name string ? (That will show the decimal equivalent of each character in the string). I'm thinking that possibly there is a "compose" or "dead byte", which is one of the ways of representing accented characters.
Jim Hokanson
Jim Hokanson 2013-8-25
Dead bytes, yikes! I like +0, easier to type than double(str). I've added some clarifications in response to Jan's question, see above. Thanks.

请先登录,再进行评论。


Jan
Jan 2013-8-25
编辑:Jan 2013-8-25
This sounds totally cruel. I've struggled UTF16 and UTF8 conversions for the file access also.
When I run this on my Win7/64 PC/local NTFS disk/Language = 'en_us.windows-1252' I get the expected correct results:
str = ['t', 233, 'st.txt'];
fid = fopen(str,'w');
fclose(fid);
a = dir('t*.txt'); % other patterns do not change the answer
double(a.name)
>> 116, 233, 115, 116, 46, 116, 120, 116
This is displayed in the Windows Explorer correctly also. But the DOS command DIR fails of course:
!dir t*st.txt
>> 25.08.13 23:20 8 tst.txt
It matters what "yields on disk" exactly mean. How did you test this?
  5 个评论
Jan
Jan 2013-8-26
编辑:Jan 2013-8-26
I have some dull English keyboards in my storage place. There is even one with a missing [shift]-key and if somebody wants to appear cool, I can even remove the vowels from a Swedish keyboard.
I had severe troubles to reconstruct a backup under Windows, because the paths exceeded the magic 260 character length due to deeply nested folders with names like "Muskelzelle, 5 Proz. Kochsalzlösung, 2-60 Stunden Einwirkzeit, 60-fach vergrößert, Ethidium bromid, ausgewertet, ok". And here the troubles have not been caused by the special characters.
30 years after MS-DOS there is still a limitation to only 260 characters in the file name for many important API functions of Windows as deleting to the trash and e.g. showing the folder in the Windows Explorer. This is such cruel and unprofessional, that I cannot understand, why users discuss about tiles and the missing start button of Win8.0. Some API functions accept long file names, when the ridiculous "\\?\" is added in front of the name, so MS did recognize the need for this feature already. But long names are far from working reliably.
So my impression is, that the NTFS file system with its UTF16 strings and the possibility for long names is mature and stable, but the Windows functions for accessing this format are still in their infancy and the level of childishness of the problems is such low, that I'd call them "bugs".
Maybe MS decided purposely to impede French, Chinese and speakers of Tagalog to increase its profit in a strange and obscure way. And while the French and the Chinese have developed Linux (with help from some Finnish), the Filipinos have written MacOS-X with the strange idea to use neither 2 byte nor 4 byte wchar's.
Using special characters in file names, especially when different operating systems access the files, is a bad idea, obviously. Do not let the childish OS ninjas involve you in their sandbox battle. 7-bit ASCII looks even good when written to durable pottery.
But seriously, unicode is nice and the way to go in the future. But currently it is neither supported reliably by the operating systems nor by Matlab. Problems like the destroyed accents will occur and can be expected. Therefore it is still a good idea to keep file names short and simple, while the interesting details in French should be hidden inside the data of the file.
[EDITED] Sorry, not the Chinese have participated in the development of Linux, but the Japanese decided to remove the \r from the line breaks for obvious reasons.
Jim Hokanson
Jim Hokanson 2013-8-27
Jan, I agree, don't use special characters in file names. I tend not to but this particular example came from some file "in the wild." It would be nice to have a well documented set of rules of what can be done and what can't with respect to unicode. For example, Matlab's usage of a 16 byte character means it is impossible to accurately handle UTF-8 data streams which are only well mapped to UTF32 (4 byte character) data. Like many things, I think the first step is probably well documented (centrally, i.e. by TMW) usage modes and failures points.
Cédric, the problem actually comes from a Hungarian name, Georg Von Békésy, so it's the Hungarians that are giving me problems, not the French :)

请先登录,再进行评论。

类别

Help CenterFile Exchange 中查找有关 Programming 的更多信息

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by