How to extract data from PDF that contains a plot and a table

Question

Camila Coria 2018-12-27

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/437432-how-to-extract-data-from-pdf-that-contains-a-plot-and-a-table

回答： adam 2023-11-29

35517.001.pdf

Hi,

I have thousands of PDFs that has similar format to the one attached. It starts with a plot, then it has a summary table with values.

I need to be able to extract the numbers in the table and store them in a matrix to be processed later (can exclude first column due to the word "average"). I will also need to be able to loop/read through the pdfs and extract the numbers. the PDF's name are in a number sequential order.

I have tried several things and it didn't work. I will appreciate any help. Thanks in advance.

2 个评论
显示无隐藏无

sherry james 2018-12-28

Hi,

As you said you want to extract the numbers present in the table then you should use some application that can easily extract data from PDF. Once such utility is SysTools PDF Toolbox Software. With this software you can extract numbers from multiple PDF documents and the data for each individual PDF is saved in the seperate .txt document.

Visit the link & extract numbers: https://www.systoolsgroup.com/pdf-toolbox.html

Camila Coria 2019-1-3

Thanks but was hoping to not need another software to keep things easier for future users.

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Surbhi Pillai 2018-12-31

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/437432-how-to-extract-data-from-pdf-that-contains-a-plot-and-a-table#answer_354473

Hi Camila

You can refer to the below MATLAB Answers link to understand the extraction of data from a pdf file in MATLAB.

https://in.mathworks.com/matlabcentral/answers/155500-how-to-extract-data-from-pdf-file-in-matlab

I hope this helps....

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

Answer 2

DGM 2023-11-29

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/437432-how-to-extract-data-from-pdf-that-contains-a-plot-and-a-table#answer_1362027

在 MATLAB Online 中打开

35517.001.pdf

fname = '35517.001.pdf';
str = extractFileText(fname);
% get the main table
T = extractBetween(str,'Dmax','AVERAGE');
T = strcat('Dmax',T);
T = split(T,newline);
% get rid of empty lines, reshape
mk = ~cellfun('isempty',T);
T = T(mk);
T = reshape(T,[],7);
% get the last row (the column averages)
lastrow = extractAfter(str,'AVERAGE');
lastrow = split(lastrow,newline);
% get rid of empty lines, reshape
mk = ~cellfun('isempty',lastrow);
lastrow = lastrow(mk);
lastrow = reshape(lastrow,[],7);
% convert to numeric data
data = str2double(T(2:end,:))
data = 3×7
1.0e+03 *

    0.0400    0.0636    0.0016    0.0107    2.2365    0.0012    0.1465
    0.0401    0.0588    0.0015    0.0092    1.8422    0.0012    0.1539
    0.0400    0.0571    0.0014    0.0085    1.7045    0.0012    0.1543
averages = str2double(lastrow)
averages = 1×7
1.0e+03 *

    0.0400    0.0599    0.0015    0.0095    1.9278    0.0012    0.1516
% extract column headers if you want to make a table or something
headers = T(1,:)
headers = 1×7 string array
    "Dmax(cm)"    "Fmax(t)"    "Keff(t/cm)"    "Qd(t)"    "EDC(t.cm)"    "K2fit(t/cm)"    "V(cm/min)"

Generally, it's not that simple. See also:

https://www.mathworks.com/matlabcentral/answers/155500-how-to-extract-data-from-pdf-file-in-matlab#answer_152421

https://www.mathworks.com/matlabcentral/fileexchange/19798-extract-text-from-a-pdf-document

You can't generally rely on tabular data being presented in the expected order. In my experience, it's not even guaranteed that a set of similar-looking related files has a consistent ordering. I don't know how extractFileText() works, but I bet that you will have to check the integrity of all your extracted data if you want to be sure it's not mixed up nonsense.