How can I remove websites' links from a text?
7 次查看(过去 30 天)
显示 更早的评论
I am trying to remove websites' links from a string. I would like to remove (or replace with a space ' ') every link that starts with 'https:'. I tried using the command regexprep, but I am able to replace only a specific link.
1 个评论
Jan
2017-2-1
Please post some relevant part of the text. Is the "https:" included in < and > or in double quotes? Can spaces appear in the links?
回答(2 个)
Iddo Weiner
2017-2-1
编辑:Iddo Weiner
2017-2-1
Dario, this really depends on what your data looks like. BUT I made an assumption regarding what your text might look like, please check out the following method:
text = 'some words https:link some other words https:otherlink final words';
disp(text)
some words https:link some other words https:otherlink final words
text_copy = text; % work on a copy so you always have the original for comparison
base_string = 'https:';
first_del_idx = strfind(text, base_string); %this is where the link string starts
% find the paired last index for each first index
last_del_idx = nan(size(first_del_idx));
for i = (length(last_del_idx)):-1:1 %the loop works "backwards"
next_idx = first_del_idx(i) + length(base_string); %no point in checking before this point
while true
if strcmp(text_copy(next_idx),' ')==1 || strcmp(text_copy(next_idx),'\'); %guard aginast the possibility of a link in the end of a line
last_del_idx(i) = next_idx;
text_copy(first_del_idx(i) : last_del_idx(i)) = []; %this is the actual deletion
break %out of the while loop
end
next_idx = next_idx + 1;
end
end
% let's see what we're left with
disp(text_copy)
some words some other words final words
Explanation: You might need to adjust a few things in your code, so here's the logic - I assumed you have a base string which could be used to find all link occurrences. I also assumed that links are written without spaces and that a space indicates the end of a link - so if you start running from "https:" and stop when you bump into a space (' '), then you found the full length of the substring that is to be deleted. Now if this is not the situation, you will need a different identifier for the end of a link, maybe '.com' or '/' - I can't know this for sure without seeing your data. There is at least 1 edge-case I could think of that could create bugs in my code - what if the link is at the end of row? In that case instead of ending with a space, it would end with a backslash '\' which would be part of a \n which signifies the beginning of a new line. So I added a condition to protect against this, but then again - your data may not have \n at the end of lines and then we'd have to think of a different identifier for these cases.
There are some principles I highlighted here that might be a little confusing - working with a copy (and not on the original data) is a good coding practice.. And I'd recommend traversing the string backwards so while erasing you don't mix-up the indices, which can cause all kinds of unwanted bugs.
I hope this helps
p.s. I worked here with strfind(), but you could substitute it with regular expression based functions, such as regexp() if you prefer. It's essentially the same in this case.
0 个评论
Christopher Creutzig
2017-11-2
Based on your description, the following should work, which uses \S8, the regex notation for “arbitrarily many not whitespace”:
regexprep(str,'https:\S*','')
0 个评论
另请参阅
类别
在 Help Center 和 File Exchange 中查找有关 Characters and Strings 的更多信息
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!