Read Unicode Files
function C = textscanu(filename, encoding, del_sym, eol_sym, wb)
% TEXTSCANU Reads Unicode strings from a file and outputs a cell array of strings
%
% -------------
% INPUT
% -------------
% filename - string with the file's name and extension
% - example: 'textscanu.m.txt'
% encoding - encoding of the file
% - default: UTF-16LE
% - examples: UTF-16LE (little Endian), UTF-8.
% - See http://www.iana.org/assignments/character-sets
% - MS Notepad saves in UTF-16LE ('Unicode'),
% UTF-16BE ('Unicode big endian'), UTF-8 and ANSI.
% del_sym - column delimitator symbol in ASCII numeric code
% - default: 9 (tabulator)
% eol_sym - end of line delimitator symbol in ASCII numeric code
% - default: 13 (carriage return) [Note: line feed=10]
% - on MS Windows use 13, on Unix 10
% wb - displays a waitbar if wb = 'waitbar'
%
% Defaults:
% -------------
% BOM - the first character of the file is assumed to be a
% Byte Order Mark and removed, if it's unicode2native()
% value is 26
% byte_encoding - this value is read from the last two characters
% of the encoding input variable if they are 'LE' or 'BE',
% otherwise 'little endian' is the default for Windows and
% 'big endian' for Unix
% eol_len - number of characters used as end of line markers;
% for a Windows AND a value of 13, eol_len is 2,
% otherwise 1
%
% -------------
% OUTPUT
% -------------
% C - cell array of strings
%
% -------------
% EXAMPLE
% -------------
% C = textscanu('textscanu.txt', 'UTF-8', 9, 13, 'waitbar');
% Reads the UTF-8 encoded file 'textscanu.m.txt', which has
% columns and lines delimited by tabulators, respectively
% carriage returns. Shows a waitbar to make the progress
% of the function's action visible.
%
% -------------
% NOTES
% -------------
% 1. Matlab's textscan function doesn't seem to handle
% properly multiscript Unicode files. Characters
% outside the ASCII range are given the \u001a or
% ASCII 26 value, which usually renders on the
% screen as a box.
%
% Additional information at "Loren on the Art of Matlab":
% http://blogs.mathworks.com/loren/2006/09/20/
% working-with-low-level-file-io-and-encodings/#comment-26764
%
% 2. Text editors such as Microsoft Notepad or Notepad++ use
% a carriage return (CR, ascii 13) and a line feed (LF, ascii 10)
% to mark line ends (when you hit the enter key for example),
% instead of just carriage return as usual on Unix or
% Microsoft Word.
%
% In textscanu use ascii 13 as delimitator in the case of
% end lines marked with the CR/LF combination. Since the LF
% is beyond the end of a given line and not part of the next,
% it is disregarded by the function.
%
% 3. If you get spaces inbetween characters, try changing
% the encoding parameter.
%
% -------------
% BUG
% -------------
% When inspecting the output with the Array Editor,
% in the Workspace or through the Command Window,
% boxes might appear instead of Unicode characters.
% Type C{1,1} at the prompt or in Array Editor click
% on C then C{1,1}: you will see the correct string
% if you have an a Unicode font for the appropriate
% character ranges installed and enabled for the Command
% Window and Array Editor (File > Preferences > Fonts).
%
% However, up to Matlab R2010a at least, Unicode
% characters display as boxes in figures, even if
% data is correctly stored in Matlab as Unicode.
%
% -------------
% REQUIREMENTS
% -------------
% Matlab version: starting with R2006b
%
% See also: textscan
%
% -------------
% REVISIONS LOG
% -------------
% 2015.01.06 - [fix] eol_len now set for all number of input arguments
% 2014.05.04 - [fix] attempt to close figure only if it exists
% 2011.01.17 - [new] support for Unix
% - [new] automatic detection of BOM presence
% 2010.12.31 - [new] no requirement anymore not to end the
% file with end of line marks
% - [fix] define default waitbar handle value
% and make the message more informative
% 2010.10.04 - [fix] upgrade to Matlab version 2007a
% 2009.06.13 - [new] added option to display a waitbar
% 2008.02.27 - function creation
%
% -------------
% CREDITS
% -------------
% Vlad Atanasiu
% atanasiu@alum.mit.edu, http://www.waqwaq.info/atanasiu/
引用格式
Vlad Atanasiu (2024). Read Unicode Files (https://www.mathworks.com/matlabcentral/fileexchange/18956-read-unicode-files), MATLAB Central File Exchange. 检索时间: .
MATLAB 版本兼容性
平台兼容性
Windows macOS Linux类别
标签
致谢
参考作品: Information-based Similarity Toolbox
启发作品: Information-based Similarity Toolbox, WH-1080 weather station data viewer
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!版本 | 已发布 | 发行说明 | |
---|---|---|---|
1.7 | [fix] eol_len now set for all number of input arguments |
|
|
1.6.0.0 | Published as Matlab toolbox (.mltbx file). |
|
|
1.5.0.0 | [fix] attempt to close figure only if figure exists |
||
1.4.0.0 | [new] no requirement anymore not to end the
|
||
1.1.0.0 | Added option to display a waitbar showing the progress of data reading. |
||
1.0.0.0 | Adding sample file, clarifying format of input file. |