Image extraction from webpage
13 次查看(过去 30 天)
显示 更早的评论
b
2020-4-27
There are serial-numbered webpages (some of these numbers don't exist), which have images of interest at one particular location in the html file:
<h4 id="COMPANY">COMPANY</h4>
<p><img class="image" border="0" src="/resources/companyName_company.jpg"/></p>
The companyName is different in each numbered webpage.
However, urlwrite gives only html pages without these images. When opened in browser, these images are absent. Since it is these images that are of interest, and none of the other content of the webpage, the whole purpose is defeated. How can this be resolved ? Is there a way to get only these images, and nothing else from the webpage ?
2 个评论
b
2020-4-27
No, the html does contain these lines. But when opened in browser, there is no image. The heading in between the <h4></h4> appears correctly. The image part, which should be just below it, does not appear.
Everything else on the webpage is unneeded information. Unable to figure out how to filter that out and extract only this image part.
This structure is unique in the html pages. Every numbered html page has the structure of <heading> immediately followed by <image> .
采纳的回答
Rik
2020-4-27
The HTML file doesn't contain the image. It contains a relative path to the image. Because you don't have the image file in the location the HTML file specifies the image doesn't show up. You need to use the 3 step process below to get the image file.
- download the HTML file
- determine '/resources/companyName_company.jpg'
- dowload the image from website.com/resources/companyName_company.jpg
18 个评论
Rik
2020-4-27
If the text around that is the same every time it should not be too difficult to write some code that will find it. You don't even need to store the HTML file, you can leave it as a char array if you use webread (or urlread on older releases).
Rik
2020-4-28
If you remove all whitespace and newlines (so chars 32, 13, and 10) you should be able to search for '</h4><p><img class'. You said this structure is unique, so you should be able to use it to find the URL you need.
Rik
2020-4-28
Where would you start? I wasn't born knowing Matlab, so you can learn it too. How can you remove specific characters from a char array? (hint: strrep) How can you find the position of a specific string in an array? (hint: strfind)
You din't share any details about the rest of the HTML document, so you will have to do this on your own. Show what you try and try to explain why it fails. That makes it easier to guide you.
Rik
2020-4-28
This is what I did:
for n=1:9
%for n=2:9
%for n=1:1000
%n=0 and 1 don't exist. For those that don't exist, it gives the error:
%Error using urlreadwrite (line 98)
%Error downloading URL. Your network connection may be down or your proxy settings improperly configured.
%Error in urlwrite (line 52)
%[f,status] = urlreadwrite(mfilename,catchErrors,url,filename,varargin{:});
try
urlwrite(sprintf('https://companyNameWebsite.org/%i?outline=by_category',n), ...
sprintf('company%i.html',n));
n
catch
%Nothing to do if n-value company webpage doesn't exist & error shows
end
end
How to get from here to getting images at
because the companyName_company are not numerical. Even in this term, only the 'companyName' varies, while the '_company' is the same.
Rik
2020-4-28
Step by step. First read the contents of the web page as a char array, then extract the image url.
You should also first do it for one page, then proceed to process all.
for n=2%1:1000
%read HTML
url=sprintf(sprintf('https://companyNameWebsite.org/%i?outline=by_category',n));
try
data=webread(url);
catch ME
%check if the error is what you expect for a non-existent page
if ~WhatYouExpect(ME)
rethrow(ME)
else
continue%go to next iteration
end
end
%now use strrep and strfind to find '</h4><p><img class'
%store the image
img_url=sprintf('%s%s','https://companyNameWebsite.org',partial_url);
websave(___)
end
b
2020-4-28
Even after sharing what I have done, there is no help from your end. If I had that much matlab knowledge, I wouldn't be undergoing this humiliation of some 'maestro' sitting at the helm and giving guidance as if to a school kid. If I had to learn on my own, I wouldn't be posting it here, right ? Please do not respond to my this or any other questions any further.
Star Strider
2020-4-28
b — This is not a trivial problem. There are no straightforward, general solutions.
Rik
2020-4-28
I didn't mean to come across as humiliating you. As Star Strider mentions: this isn't simple. Difficult problems should be cut down to smaller, solveable problems. I can't solve your question all at once, I can only describe what steps you need to take.
Since you don't share the HTML itself I can't help you with specific code. I did try to help you. You shared some code, in which you were using some functions; I posted code with different functions and a slightly different structure. I'm fine with it if you don't want any further help from me, but if you only want that because you feel I'm belittling you, I want you to know that is not my intention.
Intentions are difficult to judge on the internet, because either or both may not be native speakers of the language they use to communicate. Even if that were the case, there can still exist a socio-cultural difference. And of course it difficult to convey tone in text.
b
2020-4-28
What difference does an example URL make ? How can I share HTML ? I see no bearing of the actual HTML to this problem. How can you work in the software field/industry and be unaware of Confidentiality Clauses and Agreements ? You say that you posted a code with different functions and slightly different structure, but apart from one commented line, can you tell me what is new in your code that is not already there in the example m file ? How can you frustrate people who are already grappling with problems, let alone programming skills ?
Rik
2020-4-28
The HTML has everything to do with it. If you want explicit help we will need explicit data. An explicit example will allow me to write some code that will help you read the image URL. (side note: you never mentioned an NDA before, so why would you assume I don't know about those? And are you sure you are even allowed by that NDA to post this question?) Do you think it is a smart move to tell people they frustrate you when you are the one asking for help?
Your code has a fundamentally different structure than mine, but we don't have to argue about that.
You can leave the WhatYouExpect function blank if you like, but if you want help on the part with %now use strrep and strfind to find '</h4><p><img class', you will have to provide me with an example file. It doesn't have to be an actual file. It just has to be real enough for a parser function to work and to show you how you can improve/alter it so it works on the real files.
Rik
2020-4-29
Despite what you said, the pattern you mentioned is not unique. Below is my guess for your pattern. Modify as needed.
for n=2%1:1000
%read HTML
%url=sprintf(sprintf('https://companyNameWebsite.org/%i?outline=by_category',n));
url='https://www.mathworks.com/matlabcentral/answers/uploaded_files/288498/company3a.txt';
try
data=webread(url);
catch ME
%check if the error is what you expect for a non-existent page
if ~WhatYouExpect(ME)
rethrow(ME)
else
continue%go to next iteration
end
end
t=strsplit(data,'<h4');
pattern=' id="company"';numel_pattern=numel(pattern);
partial_url='';%set a default in case of failure
for k=1:numel(t)
try
if strcmp(t{k}(1:numel_pattern),pattern)
ind1=strfind(t{k},'src="')+4;ind1=ind1(1)+1;
ind2=strfind(t{k},'"');
ind2=ind2(ind2>ind1);ind2=ind2(1)-1;
partial_url=t{k}(ind1:ind2);
end
catch
%line too short, or url reading failed
end
end
if isempty(partial_url)
%Should the code throw an error here? Warn? Simply continue?
end
%store the image
img_url=sprintf('%s%s','https://companyNameWebsite.org',partial_url);
websave(___)
end
Rik
2020-4-29
Glad to be of help.
Since you suggested to be bound by an NDA not to provide more details I don't see what adding "(subject to testing)" is trying to accomplish. Obviously it works on a recent release of Matlab for this example, otherwise I wouldn't have posted it. The only thing it currently accomplishes is sounding condescending.
b
2020-4-30
This is working well. I can clearly see now how strsplit, strfind and strcmp can be used. After experimenting with this code on few different configurations, the run-time is also reasonable - a few hours for 1000 cases. Another thing was that the fileSize of the image file that it retrieves is exactly the same as the original file. This may not be surprising to an experienced coder, but something could be done as a modification to bring down the run-time as well as disk-space so that the user gets an option to vary the fileSize of the retrieved file. If the original image file is 4MB, but maybe only 60kb suffices, then that is a reduction by ~70 times. This will translate to an almost equivalent reduction in the run-time and surely the same amount of reduction in disk-space. Instead of 4GB of space, only 60MB will be used. The trick will be in the amount of processing time taken by the dimension or the size reducing algorithm.
But that goes beyond the purview of this question thread.
更多回答(0 个)
另请参阅
类别
在 Help Center 和 File Exchange 中查找有关 Image Processing Toolbox 的更多信息
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!发生错误
由于页面发生更改,无法完成操作。请重新加载页面以查看其更新后的状态。
您也可以从以下列表中选择网站:
如何获得最佳网站性能
选择中国网站(中文或英文)以获得最佳网站性能。其他 MathWorks 国家/地区网站并未针对您所在位置的访问进行优化。
美洲
- América Latina (Español)
- Canada (English)
- United States (English)
欧洲
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)
亚太
- Australia (English)
- India (English)
- New Zealand (English)
- 中国
- 日本Japanese (日本語)
- 한국Korean (한국어)