- Every word ends with a space
- Every line ending has a carriage return and line feed
How can I get the word count of each line from an extracted PDF file
3 次查看(过去 30 天)
显示 更早的评论
Hi, I extracted text from a PDF file with many lines/entries of comments. I want to get the word count of each line, the average word count all lines, and the number of lines that only has one word. Is this possible..? Thanks!!
0 个评论
回答(1 个)
Kiran Felix Robert
2021-2-2
Hi Yao,
I assume that you have extracted the text from a pdf file which is saved as a string variable. You can convert the string to a character array (convertStringsToChars) and count the words and lines.
Assume that
Using the built-in MATLAB example, the following program gives you the total line count and word count in the section of the file.
str = extractFileText("exampleSonnets.pdf");
ii = strfind(str,"II");
iii = strfind(str,"III");
start = ii(1);
fin = iii(1);
stringText = extractBetween(str,start,fin-1);
B = convertStringsToChars(stringText);
% Define the space character and end-of-line character
SpaceCharacter = B(3);
CarraigeReturnCharacter = B(4);
lineCount = 0;
wordCount = 0;
i = 1;
while i <= length(B)
if B(i) == CarraigeReturnCharacter
lineCount = lineCount + 1; % Total line count
end
if B(i) == SpaceCharacter
wordCount = wordCount + 1; % Total Word Count
end
i = i + 1;
end
Kiran
0 个评论
另请参阅
类别
在 Help Center 和 File Exchange 中查找有关 Text Files 的更多信息
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!