Iteratively search in a website (for dummies)

1 次查看(过去 30 天)
Hi all,
I have a list of thousands of chemical formula (or potentially formula). What I'd like to do is to iteratively get one of this formula (for i=1:size(FormulaList,1)....end), insert the formula into the search bar of the website (that is: https://pubchem.ncbi.nlm.nih.gov/ ), and check if I have a possible matches or I get something like this ("0 results found"):
I've tried to apply the method described here ( https://it.mathworks.com/matlabcentral/answers/400522-retrieving-data-from-a-web-page ) but I was not able to understand how to get the "curl" (sorry: I'm completely ignorant in this!).
Cheers,
Luca
[SL: removed the parenthesis from the end of one of the hyperlinks]

采纳的回答

Luca D'Angelo
Luca D'Angelo 2024-5-9
I've found the solution.
% MassList: column-vector with molecular formula
tic
for mass=1:size(MassList,1)
url=strcat('https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/fastformula/',MassList(mass,1),'/cids/JSON?list_return=cachekey');
try
jsonData = webread(url);
ResNum(mass,1)=jsonData.IdentifierList.Size;
catch
ResNum(mass,1)=0;
end
pause(0.205) % the website asks for max 5 requests / second
end
toc
The resulting column-array provides the number of compounds with the same molecular formula found in PubChem.

更多回答(1 个)

Steven Lord
Steven Lord 2024-5-3
Your best bet is probably to use one of the access methods that PubChem provides, as described on this page. Note the usage policy. If you have thousands of requests it's likely going to take minutes or longer, or the bulk data downloads functionality linked in the usage policy may be a better fit for your needs.
From the MATLAB side of things, the functions in this documentation category likely will be of use to you as may be the functions on this documentation page. [Before you ask no, I don't have any examples specific to using those functions to access that database.]
  3 个评论
Steven Lord
Steven Lord 2024-5-6
You haven't shown us what values you're using for the maxAttempts and waitTime variables in your code.
Luca D'Angelo
Luca D'Angelo 2024-5-6
opt=weboptions("Timeout",5);
molecularFormula = 'C9H8O4'; % Example molecular formula
apiUrl = sprintf('https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/formula/%s/cids/JSON', molecularFormula);
maxAttempts = 10; % Maximum number of attempts
waitTime = 5; % Time to wait between attempts (in seconds)
attempt = 1;
while attempt <= maxAttempts
jsonData = webread(apiUrl);
if ~isfield(jsonData, 'Waiting') || isempty(jsonData.Waiting) %|| ~strcmpi(jsonData.Waiting, 'true')
break; % Exit loop if request is not waiting anymore
end
attempt = attempt + 1;
pause(waitTime);
end
% Check if the request is still processing after the loop
if isfield(jsonData, 'Waiting') && ~isempty(jsonData.Waiting) && strcmpi(jsonData.Waiting, 'true')
disp('Your request is still processing. Please wait and try again later.');
return;
end
if isfield(jsonData, 'Fault')
disp(['Error: ', jsonData.Fault.Message]);
return;
end
numResults = 0; % Initialize number of results
if isfield(jsonData, 'IdentifierList') && isfield(jsonData.IdentifierList, 'CID')
numResults = numel(jsonData.IdentifierList.CID); % Number of search results
end
disp(['Number of results for molecular formula "', molecularFormula, '": ', num2str(numResults)]);
It doesn't really matter, actually. Most of the previous code was written by chatgpt but it's useless. The main lines are:
opt=weboptions("Timeout",5);
molecularFormula = 'C9H8O4'; % Example molecular formula
apiUrl = sprintf('https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/formula/%s/cids/JSON', molecularFormula);
jsonData = webread(apiUrl);
if webread worked, maybe I would be able to find the information I am looking for. The problem is that I think the function launches the search but then doesn't wait for the website to ‘load’ the result, so it shows ‘Your request is still running’. Maybe I should find a way to launch the command, wait and then check if the webpage 'loaded' the results. What do you think?

请先登录,再进行评论。

类别

Help CenterFile Exchange 中查找有关 Genomics and Next Generation Sequencing 的更多信息

产品


版本

R2023a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by