word count matrix problem

Question

0 个投票

Can anyone see how I can correct this code for the wordCount Matrix I am counting the unique words for all the files. I also have a 2469(unique words)*160(reviews) matrix.I have attached a snippet of the matrix for preview.

The problem I am having is that I am completely stuck on how to allocate the word counts relevant to each of the reviews. What is happening though is the total count is appearing in the first column and the rest are nil. I would very much appreciate it if someone could just have a look at my code and see if they can find the problem (probably really stupid error but I just cannot see it and have tried loads of methods to try and solve it but this appears to be the best one so far (for me at least)).

clear all;
% Collects requested files from a specified folder and inserts them into an array
fpath = ('C:\Users\Willem\Documents\MATLAB\fold1');
% Returns an error if folder is not found
if ~isdir(fpath)
errorMessage = sprintf('Error: The following folder does not exist:\n%s', fpath);
uiwait(warndlg(errorMessage));
return;
end
files = dir(fullfile(fpath,'*.oneline'));
nfiles = length(files);
data = {};
docArray = {};
if true
data = [];
% Seperates each files data strings into individual columns within the matrix
for k = 1:nfiles
thisdata = importdata(fullfile(fpath,files(k).name)); % imports the data into the matrix array
nrow = length(thisdata); % extend number of rows if needed
docArray(1:nrow,end+1) = thisdata(:); % displays each review per column
data = [data; importdata(fullfile(fpath,files(k).name))]; % creates single column array of all the words
end
end
uniqueWords = unique(data); % Checks for unique words in all the review strings
% Counts the number of times each unique word appears in each review
wordCount = zeros(numel(uniqueWords),k);
for j = 1:length(uniqueWords)
counter = 0;
for l = 1:length(data)
if isequal(uniqueWords{j},data{l})
counter = counter +1;
end
end
wordCount(j) = counter;
end

7 个评论
显示 5更早的评论隐藏 5更早的评论

Willem 2013-11-11

在 MATLAB Online 中打开

I have however found a way of counting the first file to column 1 but am still unable to find a way of looping it for each of the other files into the individual columns up to k=160. Code bellow (changed data for docArray):

if true
  % code
uniqueWords = unique(data); % Checks for unique words in all the review strings
% Counts the number of times each unique word appears in each review
wordCount = zeros(numel(uniqueWords),k);
thisreview = docArray(:,k);
for j = 1:length(uniqueWords)
  counter = 0;
  for l = 1:length(docArray)
      if isequal(uniqueWords{j}, docArray{l})
          counter = counter +1;
      end
  end
  wordCount(j) = counter;
end
end

Willem 2013-11-12

在 MATLAB Online 中打开

I am now extremely frustrated with this code because no matter how many times I try I still end up back at the same code (even with the hints and generous help of others.

I would be extremely grateful if somebody could PLEASE show me how to loop the word comparison count for each file so it then returns the values to a new column for each review.

 clear all;
% Collects requested files from a specified folder and inserts them into an array
fpath = ('C:\Users\Willem\Documents\MATLAB\fold1');
% Returns an error if folder is not found
if ~isdir(fpath)
    errorMessage = sprintf('Error: The following folder does not exist:\n%s', fpath);
    uiwait(warndlg(errorMessage));
    return;
end
files = dir(fullfile(fpath,'*.oneline'));
nfiles = length(files);
docArray = {};
data = [];
% Seperates each files data strings into individual columns within the matrix
for k = 1:nfiles
    thisdata = importdata(fullfile(fpath,files(k).name)); % imports the data into the matrix array
    nrow = length(thisdata);  % extend number of rows if needed
    docArray(1:nrow,end+1) = thisdata(:); % displays each review per column
    data = [data; thisdata]; % creates single column array of all the words
end
uniqueWords = unique(data); % Checks for unique words in all the review strings
% Counts the number of times each unique word appears in each review
wordCount = zeros(numel(uniqueWords),k);
for j = 1:length(uniqueWords)
    if max(uniqueWords{j} ~= ' ')
        for l = 1:length(docArray)
            if strcmp(docArray(l), uniqueWords{j})
                wordCount(j) = wordCount(j) +1;
            end
        end
    end
end

This what I get so far

Sorry in advance if asking this annoys anyone but I have spent countless hours trying to get to grips with it and as a Matlab newbie its doing my head in that I just don't know how to solve it.

请先登录，再进行评论。

请先登录，再回答此问题。

Follow Question

Answer 1

Cedric 2013-11-11

编辑：Cedric 2013-11-11

在 MATLAB Online 中打开

1 个投票

Here is an alternate and probably simpler solution (because it's a 1 line solution after you update the call to UNIQUE) for counting occurrences:

 >> words = { 'john', 'jim', 'john', 'john', 'james', 'john', 'james' } ;
 >> [uniqueWords,~,ic] = unique( words )
 uniqueWords = 
    'james'    'jim'    'john'
 ic =
     3     2     3     3     1     3     1
 >> counts = accumarray( ic.', ones(size(ic)) )
 counts =
     2
     1
     4

3 个评论
显示 1更早的评论隐藏 1更早的评论

Walter Roberson 2013-11-12

The poster would like to have a per-review count of each unique word.

The adapted code would probably use ismember() on each review (because not every review will have every unique word and order becomes important for the output.)

Willem 2013-11-12

Thanks to both of but even with your advice and hints I am still just going around in circles can only put it down to I don't get it, sorry for my stupidity. I hope I have not been a pain to both of you.

请先登录，再进行评论。

Answer 2

Walter Roberson 2013-11-10

在 MATLAB Online 中打开

0 个投票

Hint:

    thisreview = docArray(:,k);
      if isequal(uniqueWords{j}, thisreview{L})

5 个评论
显示 3更早的评论隐藏 3更早的评论

Willem 2013-11-11

I see now will have another crack at it this evening, thank you.

Willem 2013-11-11

I have made some progress but am still not finding a successful solution. I can find the counts for selected files or for the first or last files in the folder and they all display in the first column. But am failing to find a way of showing all the files together in the matrix in their own columns. It just seems to be beyond my abilities at present I will just keep trying or will change method. Thank you for your time Walter you have been really helpful but I dare not take up any more of your time on this issue for fear of annoying you.

请先登录，再进行评论。

Answer 3

Willem 2013-11-14

在 MATLAB Online 中打开

0 个投票

I worked out the answer to the question with only minimal changes to my original code for anyone who wishes to take note of it but be warned it loops first through all the unique words (2400 of them) and then loops through each column (160 columns) and then loops through all the rows within the columns comparing the unique words with the words in each column and if any are found it counts the number of times the word occurs in that column and returns these count to a word count matrix. This does take quite some time to complete about 10-15 minutes in total. If anyone can think of a way to make this method more efficient I am more than happy to know as I have a much larger matrix to complete (4x lager) and so do not wish to be spending this length of time waiting for the matrix to compute word counts on a total of 800 files.

    testCount = zeros(numel( trainTerms ), a); % sets a nil value matrix to the size required
    for j = 1:numel( trainTerms )
        for l = 1:size( docArray1, 1 )
            for ll = 1:size( docArray1, 2 )
                if (strcmp( docArray1 {l, ll}, trainTerms {j})==1)
                    testCount (j, ll) = testCount (j, ll) +1;
                end
            end
        end
    end

8 个评论
显示 6更早的评论隐藏 6更早的评论

Walter Roberson 2013-11-14

Suppose you switched around the two cellstrings ?

Willem 2013-11-15

在 MATLAB Online 中打开

I have discovered this which takes the tf counts matches them with their corresponding indexes and then counts them automatically. It is quick but am not sure how best to loop it for every document without slowing it down.

stringCount(:,a) = cellfun(@(x) sum(ismember(thisDataFold1,x)), trainTerms)

I am able to run it through my current loop instead of strcmp() but takes just as long so am guessing my loops are whats causing the delays. I am guessing that it may be looping more than once for each check judging by the results I am receiving at different times that I cut the process short (ctrl+c) (eg. 5 mins could show 3 columns and 2 mins could show 33 columns). Have tried to take a look at the code again but cannot see the error.

请先登录，再进行评论。

word count matrix problem

7 个评论
显示 5更早的评论隐藏 5更早的评论

回答（3 个）

3 个评论
显示 1更早的评论隐藏 1更早的评论

5 个评论
显示 3更早的评论隐藏 3更早的评论

8 个评论
显示 6更早的评论隐藏 6更早的评论

类别

标签

Community Treasure Hunt

word count matrix problem

7 个评论 显示 5更早的评论 隐藏 5更早的评论

回答（3 个）

3 个评论 显示 1更早的评论 隐藏 1更早的评论

5 个评论 显示 3更早的评论 隐藏 3更早的评论

8 个评论 显示 6更早的评论 隐藏 6更早的评论

类别

标签

另请参阅

Community Treasure Hunt

7 个评论
显示 5更早的评论隐藏 5更早的评论

3 个评论
显示 1更早的评论隐藏 1更早的评论

5 个评论
显示 3更早的评论隐藏 3更早的评论

8 个评论
显示 6更早的评论隐藏 6更早的评论