word count matrix problem

Can anyone see how I can correct this code for the wordCount Matrix I am counting the unique words for all the files. I also have a 2469(unique words)*160(reviews) matrix.I have attached a snippet of the matrix for preview.
The problem I am having is that I am completely stuck on how to allocate the word counts relevant to each of the reviews. What is happening though is the total count is appearing in the first column and the rest are nil. I would very much appreciate it if someone could just have a look at my code and see if they can find the problem (probably really stupid error but I just cannot see it and have tried loads of methods to try and solve it but this appears to be the best one so far (for me at least)).
clear all;
% Collects requested files from a specified folder and inserts them into an array
fpath = ('C:\Users\Willem\Documents\MATLAB\fold1');
% Returns an error if folder is not found
if ~isdir(fpath)
errorMessage = sprintf('Error: The following folder does not exist:\n%s', fpath);
uiwait(warndlg(errorMessage));
return;
end
files = dir(fullfile(fpath,'*.oneline'));
nfiles = length(files);
data = {};
docArray = {};
if true
data = [];
% Seperates each files data strings into individual columns within the matrix
for k = 1:nfiles
thisdata = importdata(fullfile(fpath,files(k).name)); % imports the data into the matrix array
nrow = length(thisdata); % extend number of rows if needed
docArray(1:nrow,end+1) = thisdata(:); % displays each review per column
data = [data; importdata(fullfile(fpath,files(k).name))]; % creates single column array of all the words
end
end
uniqueWords = unique(data); % Checks for unique words in all the review strings
% Counts the number of times each unique word appears in each review
wordCount = zeros(numel(uniqueWords),k);
for j = 1:length(uniqueWords)
counter = 0;
for l = 1:length(data)
if isequal(uniqueWords{j},data{l})
counter = counter +1;
end
end
wordCount(j) = counter;
end

7 个评论

Why are you using importdata() twice on the same file? The data is already stored in thisdata.
data = [data; thisdata];
Why did you leave out the code for extending the matrix?
if size(docArray,1) < nrow
docArray(nrow,1) = {}; %extend number of rows if needed
end
the thisdata thing was just me playing around and forgetting to change things back sorry for the stupidity. The code
if true
if size(docArray,1) < nrow
docArray(nrow,1) = {}; %extend number of rows if needed
end
end
is inserted but I left it out due to the rows extending without it and it returning and error.
Subscripted assignment dimension mismatch.
Error in Untitled (line 22) docArray(nrow,1) = {}; %extend number of rows if needed
Make it
docArray{nrow,1} = '';
I changed the brackets to quotes as suggested but it simply shifted the 160 columns to the left and created an empty 1st column.
I have however found a way of counting the first file to column 1 but am still unable to find a way of looping it for each of the other files into the individual columns up to k=160. Code bellow (changed data for docArray):
if true
% code
uniqueWords = unique(data); % Checks for unique words in all the review strings
% Counts the number of times each unique word appears in each review
wordCount = zeros(numel(uniqueWords),k);
thisreview = docArray(:,k);
for j = 1:length(uniqueWords)
counter = 0;
for l = 1:length(docArray)
if isequal(uniqueWords{j}, docArray{l})
counter = counter +1;
end
end
wordCount(j) = counter;
end
end
I am now extremely frustrated with this code because no matter how many times I try I still end up back at the same code (even with the hints and generous help of others.
I would be extremely grateful if somebody could PLEASE show me how to loop the word comparison count for each file so it then returns the values to a new column for each review.
clear all;
% Collects requested files from a specified folder and inserts them into an array
fpath = ('C:\Users\Willem\Documents\MATLAB\fold1');
% Returns an error if folder is not found
if ~isdir(fpath)
errorMessage = sprintf('Error: The following folder does not exist:\n%s', fpath);
uiwait(warndlg(errorMessage));
return;
end
files = dir(fullfile(fpath,'*.oneline'));
nfiles = length(files);
docArray = {};
data = [];
% Seperates each files data strings into individual columns within the matrix
for k = 1:nfiles
thisdata = importdata(fullfile(fpath,files(k).name)); % imports the data into the matrix array
nrow = length(thisdata); % extend number of rows if needed
docArray(1:nrow,end+1) = thisdata(:); % displays each review per column
data = [data; thisdata]; % creates single column array of all the words
end
uniqueWords = unique(data); % Checks for unique words in all the review strings
% Counts the number of times each unique word appears in each review
wordCount = zeros(numel(uniqueWords),k);
for j = 1:length(uniqueWords)
if max(uniqueWords{j} ~= ' ')
for l = 1:length(docArray)
if strcmp(docArray(l), uniqueWords{j})
wordCount(j) = wordCount(j) +1;
end
end
end
end
This what I get so far
Sorry in advance if asking this annoys anyone but I have spent countless hours trying to get to grips with it and as a Matlab newbie its doing my head in that I just don't know how to solve it.

请先登录,再进行评论。

回答(3 个)

Cedric
Cedric 2013-11-11
编辑:Cedric 2013-11-11
Here is an alternate and probably simpler solution (because it's a 1 line solution after you update the call to UNIQUE) for counting occurrences:
>> words = { 'john', 'jim', 'john', 'john', 'james', 'john', 'james' } ;
>> [uniqueWords,~,ic] = unique( words )
uniqueWords =
'james' 'jim' 'john'
ic =
3 2 3 3 1 3 1
>> counts = accumarray( ic.', ones(size(ic)) )
counts =
2
1
4

3 个评论

Thank you I can see the logic to your code but again have failed in implementing it into my own code.
The poster would like to have a per-review count of each unique word.
The adapted code would probably use ismember() on each review (because not every review will have every unique word and order becomes important for the output.)
Thanks to both of but even with your advice and hints I am still just going around in circles can only put it down to I don't get it, sorry for my stupidity. I hope I have not been a pain to both of you.

请先登录,再进行评论。

Hint:
thisreview = docArray(:,k);
if isequal(uniqueWords{j}, thisreview{L})

5 个评论

Thank you I will check it out and get back to you. Thanks again.
I have spent hours playing with the hint you gave but have still not succeeded in solving the issue. Will take a break now and see if I can solve it later. Feeling a bit stupid really because I am betting it will be something really simple after all that. I appreciate the hint even if I don't know how to apply it.
There are multiple ways to proceed. The way that is closest to how you have set up your code at the moment is to loop over the unique words, and for each of them loop over the reviews, counting the number of times that word occurs in that review, and setting the entry at (word_number, review_number) appropriately.
I see now will have another crack at it this evening, thank you.
I have made some progress but am still not finding a successful solution. I can find the counts for selected files or for the first or last files in the folder and they all display in the first column. But am failing to find a way of showing all the files together in the matrix in their own columns. It just seems to be beyond my abilities at present I will just keep trying or will change method. Thank you for your time Walter you have been really helpful but I dare not take up any more of your time on this issue for fear of annoying you.

请先登录,再进行评论。

I worked out the answer to the question with only minimal changes to my original code for anyone who wishes to take note of it but be warned it loops first through all the unique words (2400 of them) and then loops through each column (160 columns) and then loops through all the rows within the columns comparing the unique words with the words in each column and if any are found it counts the number of times the word occurs in that column and returns these count to a word count matrix. This does take quite some time to complete about 10-15 minutes in total. If anyone can think of a way to make this method more efficient I am more than happy to know as I have a much larger matrix to complete (4x lager) and so do not wish to be spending this length of time waiting for the matrix to compute word counts on a total of 800 files.
testCount = zeros(numel( trainTerms ), a); % sets a nil value matrix to the size required
for j = 1:numel( trainTerms )
for l = 1:size( docArray1, 1 )
for ll = 1:size( docArray1, 2 )
if (strcmp( docArray1 {l, ll}, trainTerms {j})==1)
testCount (j, ll) = testCount (j, ll) +1;
end
end
end
end

8 个评论

Hint:
[tf, idx] = ismember(CellString1, CellString2);
Then think about how you might use idx to do counting.
do you mean something like this?
idx = strmatch('a', trainTerms, 'exact');
No, I mean ismember(). Look at the documentation for it, and see how it might help you vectorize.
If I understand correctly this is tf weighting and indexing. Am I correct?
Suppose you already knew for sure that everything in the first cellstring could be found in the second cellstring ?
I have a list of 2400 unique words and have tested it on a single review to see what happens and it resulted in the idx showing the index value of each matched word and a 0 for any mismatches. The tf shows a 1 for present and 0 for absent and ultimately matching the 1s to the relevant index and adding them up would return a unique word count. Appears fast but have only tested it in one review so far and have not yet done a count but will get back to you on progress thank you.
Suppose you switched around the two cellstrings ?
I have discovered this which takes the tf counts matches them with their corresponding indexes and then counts them automatically. It is quick but am not sure how best to loop it for every document without slowing it down.
stringCount(:,a) = cellfun(@(x) sum(ismember(thisDataFold1,x)), trainTerms)
I am able to run it through my current loop instead of strcmp() but takes just as long so am guessing my loops are whats causing the delays. I am guessing that it may be looping more than once for each check judging by the results I am receiving at different times that I cut the process short (ctrl+c) (eg. 5 mins could show 3 columns and 2 mins could show 33 columns). Have tried to take a look at the code again but cannot see the error.

请先登录,再进行评论。

类别

帮助中心File Exchange 中查找有关 Logical 的更多信息

提问:

2013-11-10

评论:

2013-11-15

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by