How to extract data from a long set of strings and put it into one cell/array/matrix?
2 次查看(过去 30 天)
显示 更早的评论
Dear Colleges,
I am sitting here now for too long and think I ultimately need your help. Problem:
I crawled a website with search info and responding I wirte each request into a string in MATLAB. Thus I have a data folder containing ~1000 .mat elements.
Inside a string it looks e.g. like this [see end of post for data]. My problem is, that I want to crawl through all the data and extract the information given by
- dc:identifier
- dc:title
- dc:creator
- prism:publicationName
- prism:coverDate
- prism:coverDisplayDate
- prism:doi
- citedby-count
- prism:aggregationType
from the strings. That means that I want to search for the dc:identifier entry, extract the data after that entry, delet all " _ ' whatever signs and put the information into a cell/matrix.
Here one string of the 1000 ~ has mostly more then 1 dataset (mostly 200) indside. Therefor I would like to exerpt all data in, perhaps, a cell array where the headcolumn represents the "dc:identifier" etc. and each following column contains then one dataset ending up in having ~ 147.000 dataset in one "array" / "cell" whatever.
So up to now I tried strsplit and regex but my Matlab language knowledge is coming to and end.
Another try is to put the following into a huge for loop and reading one after one stringdataset and trying to get data out
somestring = COMPdata2010res7;
underscore_indices = strfind(somestring,'"dc:title":"');
fs_indices = strfind(somestring,'creator"');
title = somestring(underscore_indices(end)+12:fs_indices(end)-1);
somestring = COMPdata2010res7;
underscore_indices = strfind(somestring,'creator":"');
fs_indices = strfind(somestring,'","prism:publication');
creator = somestring(underscore_indices(end)+10:fs_indices(end)-1);
[DATA Example from ONE String] [Data filename: "COMPdata2010res7.mat"]
{"search-results":{"opensearch:totalResults":"3127","opensearch:startIndex":"1201","opensearch:itemsPerPage":"200","opensearch:Query":{"@role": "request", "@searchTerms": "%28%28ALL%28battery%29+AND+NOT+KEY%28primary%29+AND+%28+TITLE-ABS-KEY%28%22battery+system%22%29+OR+TITLE-ABS-KEY%28%22battery+module%22%29+OR+TITLE-ABS-KEY%28%22battery+pack%22%29+OR+TITLE-ABS-KEY%28%22secondary+battery%22%29%29+%29+AND+PUBYEAR+%3E+2009+AND+PUBYEAR+%3C+2011%29+AND+%28SUBJAREA%28COMP%29+%29", "@startPage": "1201"},"link": [{"@_fa": "true", "@ref": "self", "@href": "http://api.elsevier.com:80/content/search/scopus?start=1201&count=200&query=%28%28ALL%28battery%29+AND+NOT+KEY%28primary%29+AND+%28+TITLE-ABS-KEY%28%22battery+system%22%29+OR+TITLE-ABS-KEY%28%22battery+module%22%29+OR+TITLE-ABS-KEY%28%22battery+pack%22%29+OR+TITLE-ABS-KEY%28%22secondary+battery%22%29%29+%29+AND+PUBYEAR+%3E+2009+AND+PUBYEAR+%3C+2011%29+AND+%28SUBJAREA%28COMP%29+%29&apiKey=6492f9c867ddf3e84baa10b5971e3e3d", "@type": "application/json"},{"@_fa": "true", "@ref": "first", "@href": "http://api.elsevier.com:80/content/search/scopus?start=0&count=200&query=%28%28ALL%28battery%29+AND+NOT+KEY%28primary%29+AND+%28+TITLE-ABS-KEY%28%22battery+system%22%29+OR+TITLE-ABS-KEY%28%22battery+module%22%29+OR+TITLE-ABS-KEY%28%22battery+pack%22%29+OR+TITLE-ABS-KEY%28%22secondary+battery%22%29%29+%29+AND+PUBYEAR+%3E+2009+AND+PUBYEAR+%3C+2011%29+AND+%28SUBJAREA%28COMP%29+%29&apiKey=6492f9c867ddf3e84baa10b5971e3e3d", "@type": "application/json"},{"@_fa": "true", "@ref": "prev", "@href": "http://api.elsevier.com:80/content/search/scopus?start=1001&count=200&query=%28%28ALL%28battery%29+AND+NOT+KEY%28primary%29+AND+%28+TITLE-ABS-KEY%28%22battery+system%22%29+OR+TITLE-ABS-KEY%28%22battery+module%22%29+OR+TITLE-ABS-KEY%28%22battery+pack%22%29+OR+TITLE-ABS-KEY%28%22secondary+battery%22%29%29+%29+AND+PUBYEAR+%3E+2009+AND+PUBYEAR+%3C+2011%29+AND+%28SUBJAREA%28COMP%29+%29&apiKey=6492f9c867ddf3e84baa10b5971e3e3d", "@type": "application/json"},{"@_fa": "true", "@ref": "next", "@href": "http://api.elsevier.com:80/content/search/scopus?start=1401&count=200&query=%28%28ALL%28battery%29+AND+NOT+KEY%28primary%29+AND+%28+TITLE-ABS-KEY%28%22battery+system%22%29+OR+TITLE-ABS-KEY%28%22battery+module%22%29+OR+TITLE-ABS-KEY%28%22battery+pack%22%29+OR+TITLE-ABS-KEY%28%22secondary+battery%22%29%29+%29+AND+PUBYEAR+%3E+2009+AND+PUBYEAR+%3C+2011%29+AND+%28SUBJAREA%28COMP%29+%29&apiKey=6492f9c867ddf3e84baa10b5971e3e3d", "@type": "application/json"},{"@_fa": "true", "@ref": "last", "@href": "http://api.elsevier.com:80/content/search/scopus?start=2927&count=200&query=%28%28ALL%28battery%29+AND+NOT+KEY%28primary%29+AND+%28+TITLE-ABS-KEY%28%22battery+system%22%29+OR+TITLE-ABS-KEY%28%22battery+module%22%29+OR+TITLE-ABS-KEY%28%22battery+pack%22%29+OR+TITLE-ABS-KEY%28%22secondary+battery%22%29%29+%29+AND+PUBYEAR+%3E+2009+AND+PUBYEAR+%3C+2011%29+AND+%28SUBJAREA%28COMP%29+%29&apiKey=6492f9c867ddf3e84baa10b5971e3e3d", "@type": "application/json"}],"entry": [{"@_fa": "true", "link": [{"@_fa": "true", "@ref": "self", "@href": "http://api.elsevier.com/content/abstract/scopus_id/78049372368"},{"@_fa": "true", "@ref": "author-affiliation", "@href": "http://api.elsevier.com/content/abstract/scopus_id/78049372368?field=author,affiliation"},{"@_fa": "true", "@ref": "scopus", "@href": "http://www.scopus.com/inward/record.url?partnerID=HzOxMe3b&scp=78049372368&origin=inward"},{"@_fa": "true", "@ref": "scopus-citedby", "@href": "http://www.scopus.com/inward/citedby.url?partnerID=HzOxMe3b&scp=78049372368&origin=inward"},{"@_fa": "true", "@ref": "full-text", "@href": "http://api.elsevier.com/content/article/eid/1-s2.0-S0360835210002287"}],"prism:url":"http://api.elsevier.com/content/abstract/scopus_id/78049372368","dc:identifier":"SCOPUS_ID:78049372368","eid":"2-s2.0-78049372368","dc:title":"Developing Oregon's renewable energy portfolio using fuzzy goal programming model","dc:creator":"Daim T.","prism:publicationName":"Computers and Industrial Engineering","prism:issn":"03608352","prism:volume":"59","prism:issueIdentifier":"4","prism:pageRange":"786-793","prism:coverDate":"2010-11-01","prism:coverDisplayDate":"November 2010","prism:doi":"10.1016/j.cie.2010.08.004","pii":"S0360835210002287","citedby-count":"16","affiliation": [{"@_fa": "true", "affilname":"Portland State University","affiliation-city":"Portland","affiliation-country":"United States"}],"prism:aggregationType":"Journal","subtype":"ar","subtypeDescription":"Article","source-id":"18164"},{"@_fa": "true", "link": [{"@_fa": "true", "@ref": "self", "@href": "http://api.elsevier.com/content/abstract/scopus_id/79953798579"},{"@_fa": "true", "@ref": "author-affiliation", "@href": "http://api.elsevier.com/content/abstract/scopus_id/79953798579?field=author,affiliation"},{"@_fa": "true", "@ref": "scopus", "@href": "http://www.scopus.com/inward/record.url?partnerID=HzOxMe3b&scp=79953798579&origin=inward"},{"@_fa": "true", "@ref": "scopus-citedby", "@href": "http://www.scopus.com/inward/citedby.url?partnerID=HzOxMe3b&scp=79953798579&origin=inward"}],"prism:url":"http://api.elsevier.com/content/abstract/scopus_id/79953798579","dc:identifier":"SCOPUS_ID:79953798579","eid":"2-s2.0-79953798579","dc:title":"Discovery and analysis of tightly knit communities in telecom social networks","dc:creator":"Modani N.","prism:publicationName":"IBM Journal of Research and Development","prism:issn":"00188646","prism:eIssn":"00188646","prism:volume":"54","prism:issueIdentifier":"6","prism:coverDate":"2010-11-01","prism:coverDisplayDate":"November 2010","prism:doi":"10.1147/JRD.2010.2081230","citedby-count":"2","affiliation": [{"@_fa": "true", "affilname":"IBM India Research Laboratory New Delhi","affiliation-city":"New Delhi","affiliation-country":"India"}],"prism:aggregationType":"Journal","subtype":"ar","subtypeDescription":"Article","article-number":"5643246","source-id":"15099"},{"@_fa": "true", "link": [{"@_fa": "true", "@ref": "self", "@href": "http://api.elsevier.com/content/abstract/scopus_id/77958558552"},{"@_fa": "true", "@ref": "author-affiliation", "@href": "http://api.elsevier.com/content/abstract/scopus_id/77958558552?field=author,affiliation"},{"@_fa": "true", "@ref": "scopus", "@href": "http://www.scopus.com/inward/record.url?partnerID=HzOxMe3b&scp=77958558552&origin=inward"},{"@_fa": "true", "@ref": "scopus-citedby", "@href": "http://www.scopus.com/inward/citedby.url?partnerID=HzOxMe3b&scp=77958558552&origin=inward"}],"prism:url":"http://api.elsevier.com/content/abstract/scopus_id/77958558552","dc:identifier":"SCOPUS_ID:77958558552","eid":"2-s2.0-77958558552","dc:title":"The development of the display terminal system used in PHEV based on CAN bus"
1 个评论
Stephen23
2015-7-14
It is easier for everyone if you simply upload your sample data, rather than giving it in your question. You can edit your question, delete that huge block of text, and then upload that text using the paperclip button and the pressing both Choose file and Attach file
回答(1 个)
Abhishek Pandey
2015-7-16
Hello Marcus,
I understand that you’re trying to extract information like identifier, title, creator, and so on from a search string, and organize it into a cell/matrix.
Although it would be easier for the community to help if you attached sample data with your question, I believe you might be able to do this using “ strsplit ” and “ strfind ” function.
The “strsplit” function takes a string and a delimiter as input arguments and gives a cell array containing the strings split by the specified delimiter as output. A string pattern can be used as a delimiter here. Whereas the “strfind” function searches the string for occurrences of the delimiter, and returns a vector of indices wherever the delimiter string occurs in the string.
For example, for the following lines of code,
str = 'abcdabcdcdefabcd';
A = strsplit(str, 'ab')
The output is:
A =
' ' 'cd' 'cdcdef' 'cd'
On the other hand, for the following lines of code,
str = 'abcdabcdcdefabcd';
A = strfind (str, 'ab')
The output is:
B =
1 5 13
Since the information that you are seeking seems to be in a specific order, you could store the separated strings for different delimiters in different vectors and associate them accordingly using their indices.
I hope that helps!
- Abhishek
0 个评论
另请参阅
类别
在 Help Center 和 File Exchange 中查找有关 Characters and Strings 的更多信息
产品
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!