Extracting data from pdf files

Question

joseph Frank 2014-4-19

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/126386-extracting-data-from-pdf-files

回答： Christopher Creutzig 2021-4-27

Hi,

I have around 300 pdf files with 19 pages each. I want to extract from each of them a fraction of a table on page 4 in order to build a research data set. Is i possible to do so using matlab? if so,which toolboxes and functions I need. I have matlab 2013a.

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Kristian Gennaci 2014-4-21

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/126386-extracting-data-from-pdf-files#answer_134069

Hi Joseph,

Have you tried using this File Exchange submission?

http://www.mathworks.com/matlabcentral/fileexchange/19798-extract-text-from-a-pdf-document

This seems like the most promising solution. Alternatively, if you could convert the tables to an excel spreadsheet/CSV format, they can then easily be parsed using MATLAB's Excel/CSV functions:

http://www.mathworks.com/help/matlab/spreadsheets.html

http://www.mathworks.com/help/matlab/ref/csvread.html

I'll let you know if I find any other solutions.

Best,

Kristian

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

Answer 2

Christopher Creutzig 2021-4-27

0
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/126386-extracting-data-from-pdf-files#answer_685860

JFTR, since R2017b, extractFileText('filename.pdf','Pages',4) from Text Analytics Toolbox gives you the text on ("physical") page 4 of the PDF, from which you can then extract the parts you need with string operations (extractBetween, regexp, etc.).