How to access unicode strings through MEX/Engine C interfaces?
5 次查看(过去 30 天)
显示 更早的评论
Short version
How can I access the underlying unicode data of MATLAB strings through the MATLAB Engine or MEX interface?
Here's an example. Let's put unicode characters in a UTF-8 encoded file test.txt, then read it as
fid=fopen('test.txt','r','l','UTF-8');
s=fscanf(fid, '%s')
If I first do feature('DefaultCharacterSet', 'UTF-8'), then engEvalString(ep, "s"), then I get back the text from the file as UTF-8. This proves that MATLAB stores it as unicode internally. However if I do mxArrayToString(engGetVariable(ep, "s")), I get what unicode2native(s, 'Latin-1') would give me in MATLAB: all non-Latin-1 characters replaces by code 26. What I need is getting access to the underlying unicode data as a C string in any unicode format (UTF-8, UTF-16, etc.) Is this possible?
- "[mxArrayToString] supports multibyte encoded characters."
So how can I get the multibyte non-Latin-1 characters then?
----
Original long version What character encoding does MATLAB use internally---if any---and is there a way to control this? To be precise, I would like to know if there is a way to guarantee that any character array I retrieve is going to adhere to a particular encoding, preferably a unicode one.
I am interfacing MATLAB with another library through the MATLAB Engine interface and I need to guarantee a character encoding when sending strings to the other library. Is this possible at all, or are MATLAB's strings plain char arrays with no associated encoding?
Related things I found:
- This here says that it uses UTF-16, but that's not what I see when I retrieve strings in C code.
- I found references to feature('DefaultCharacterEncoding', 'UTF-8') on the web. What this appears to do is control what encoding the input commands (engEvalString) are assumed to have, and how the output is encoded. If I supply a UTF-8 encoded á as s='á', then retrieve this in C, I get an ISO-Latin-1 encoded á. If I send something that's not in Latin-1, I get nonsense (actually character code 26). (At least this is my impression after a few simple tests---these are time consuming)
In light of this finding, I'd like to know: does MATLAB support unicode for all its strings? If yes, how do I get access to these from the C interface? (Any unicode encoding is acceptable, UTF8, UTF16, UCS32, etc.) If it doesn't support unicode, is ISO-Latin-1 its default? Can I assume that all strings I retrieve though the C interface can be interpreted as ISO-Latin-1?
Also, any pointers to the relevant documentation on the issue is most welcome.
(I should probably mention that I was testing this on OS X as I'm aware that there are differences in the implementation of the matlab engine interface between platforms.)
2 个评论
采纳的回答
Jan
2013-2-18
Sorry, this does not match your question exactly, but perhaps it is useful for the topic.
I believe mxChar was originally intended to be UTF-16, however the surrogate pair style unicode characters do not appear to be fully supported. However I suspect passing these characters through MATLAB 'mxChar' to the operating system should still be fine as MATLAB links against ICU (International Components for Unicode).
For compilers that have 'wchar_t' as a 16-bit value and use encoding schemes UTF-16 / UCS-2, this code will be safe.
For 32-bit 'wchar_t' values, you would need to do a conversion from UTF-16 to the encoding scheme employed by the operating system. For basic MATLAB strings to UTF-32, you could potentially just leave the upper 16-bits at zero. However as you expect, there may be certain strings obtained from the operating system that are in surrogate pair form, which require a slightly more advanced conversion. It may be better to utilize a separate library such as ICU to do the conversion between UTF-16 and the Linux encoding scheme.
更多回答(1 个)
Walter Roberson
2013-2-18
MATLAB uses a 16 bit character internally, but it does not use UTF-anything. It simply uses the first 65536 Unicode code points.
0 个评论
另请参阅
类别
在 Help Center 和 File Exchange 中查找有关 Characters and Strings 的更多信息
产品
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!