Using Regexp to extract complete addresses

Question

Jordan Barrett 2021-4-6

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/794457-using-regexp-to-extract-complete-addresses

回答： Abhinav Aravindan 2025-2-24

在 MATLAB Online 中打开

I'm trying to extract the addresses from a large pdf. Here is a screenshot of the pdf:

Here is the code I'm using:

str = extractFileText("document_name.pdf");
expression = '\d{1,8}\s\w*\s\w*\n';
startIndex = regexp(str,expression,'match');

However, this code only extracts addresses that begin with 1-8 digits, then have a space, then some letters, then a space, then more letters, then a new line.

As you can see in the screenshot, not all addresses are in this format. Some start with numbers, a space, then one word, then a new line, some have numbers then several words, then a new line, etc. How can I extract every full address?

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Abhinav Aravindan 2025-2-24

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/794457-using-regexp-to-extract-complete-addresses#answer_1560455

在 MATLAB Online 中打开

Hi @Jordan Barrett,

From the screenshot provided, it seems that the addresses in your PDF start with a 4-digit number, followed by one or more words. Assuming each column in the screenshot is a page of the PDF, to extract addresses matching this pattern, you may iterate through each line of the PDF text and use the regular expression as mentioned in the code snippet below:

% PDF content
fileContent = extractFileText("document_name.pdf");
% Split the content into lines
lines = strsplit(fileContent, '\n');
addresses = [];
addressPattern = '\d{4}\s[A-Z\s]+';
% Extract addresses
for i = 1:length(lines)
    line = strtrim(lines{i});
    matches = regexp(line, addressPattern, 'match');
    if ~isempty(matches)
        addresses = [addresses; matches];
    end
end
disp('Extracted Addresses:');
disp(addresses);

You may refer to the below documentation on “regexp” for more detail:

https://www.mathworks.com/help/matlab/ref/regexp.html

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

Using Regexp to extract complete addresses

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

回答（1 个）

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

另请参阅

类别

标签

Community Treasure Hunt

Using Regexp to extract complete addresses

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

回答（1 个）

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

另请参阅

类别

标签

Community Treasure Hunt

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

0 个评论
显示 -2更早的评论隐藏 -2更早的评论