How to Read PDF file in Matlab?
217 次查看(过去 30 天)
显示 更早的评论
I want to read pdf file and make some changes in it and then save them in excel.... I have tried my best but fail every time....Need your help....Any effort will be greatly appreciated..Thanks in advance.....
20 个评论
Geoff Hayes
2014-8-16
What kind of changes do you want to make to the PDF that you wish to then save to Excel? What is the code that you have written so far?
azizullah khan
2014-8-16
i want to capture some data...and i havn't written code up till now...My 1st step is to read pdf file...........thanks for comments.
azizullah khan
2014-8-25
Geoff Hayes thanks for comments. Please just give me a clue how i can be possible to read pdf files...I am waiting for your response..
Geoff Hayes
2014-8-25
azizullah - I noticed that you looked at Dimitri Shvorob's extract text from PDF on the MATLAB File Exchange, but you had some problems with it. Did you download the two libraries that are needed for this submission, and modify the pdfParseDemo.m file as per the author's instructions?
One of the comments in the above submission indicates that there is a utility called pdftotext that you may be able to call from within the MATLAB code. Have you looked in to this?
José-Luis
2014-8-25
What is your goal with this? It might be that Matlab is not the best tool for this.
azizullah khan
2014-8-25
yes i have done which was required but pdfParsedemo makes a problem with me...
azizullah khan
2014-8-25
thanks Jose-Luis:MY goal is to capture data from pdf file and save the data to excel (the capture data)...
Geoff Hayes
2014-8-25
Is there just one PDF file, or several? What data in particular are you looking for in the pdf - a table of numeric data, some text, or ..?
José-Luis
2014-8-25
Why go through Matlab at all? Use Excel directly. A quick google search will tell you how to import pdf's to Excel.
azizullah khan
2014-8-25
I have thousands of pdf files and get data from the pdf files and manually it's very difficult.That is why i am using matlab at all.Thanks
Geoff Hayes
2014-8-25
Have you considered using pdftotext? Or any other converter, to HTML for example? Supposing that you are able to convert the file to text, what would you be looking in it for? Is there just one page of data that you need or one line from each page or..?
You might want to provide an example of a PDF that you wish to extract data from, and indicate which data in the file you want.
Jan
2014-8-26
@azizullah khan: You wrote "but pdfParsedemo makes a problem with me...". Please explain the problems. Your question is much to vague to be answered efficiently.
azizullah khan
2014-8-26
编辑:Walter Roberson
2015-5-25
The problem with pdfParsedemo:...when i simulate the code the following error appear
??? Java exception occurred:
java.lang.NoClassDefFoundError: org/fontbox/afm/AFMParser
at org.pdfbox.pdmodel.font.PDFont.getAFM(PDFont.java:350)
at org.pdfbox.pdmodel.font.PDFont.getAverageFontWidthFromAFMFile(PDFont.java:313)
at org.pdfbox.pdmodel.font.PDSimpleFont.getAverageFontWidth(PDSimpleFont.java:231)
at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:276)
Error in ==> Untitled at 20
pdfstr = reader.getText(pdfdoc) %#ok
java.lang.Throwable: Warning: You did not close the PDF Document
at org.pdfbox.cos.COSDocument.finalize(COSDocument.java:418)
at java.lang.ref.Finalizer.invokeFinalizeMethod(Native Method)
at java.lang.ref.Finalizer.runFinalizer(Unknown Source)
at java.lang.ref.Finalizer.access$100(Unknown Source)
at java.lang.ref.Finalizer$FinalizerThread.run(Unknown Source)
java.lang.Throwable: Warning: You did not close the PDF Document
at org.pdfbox.cos.COSDocument.finalize(COSDocument.java:418)
at java.lang.ref.Finalizer.invokeFinalizeMethod(Native Method)
at java.lang.ref.Finalizer.runFinalizer(Unknown Source)
at java.lang.ref.Finalizer.access$100(Unknown Source)
at java.lang.ref.Finalizer$FinalizerThread.run(Unknown Source)
java.lang.Throwable: Warning: You did not close the PDF Document
at org.pdfbox.cos.COSDocument.finalize(COSDocument.java:418)
at java.lang.ref.Finalizer.invokeFinalizeMethod(Native Method)
at java.lang.ref.Finalizer.runFinalizer(Unknown Source)
at java.lang.ref.Finalizer.access$100(Unknown Source)
at java.lang.ref.Finalizer$FinalizerThread.run(Unknown Source)
azizullah khan
2014-8-26
Hoeff Hayes: I have attached pdf file which i want to read and extract account info and some other data.Please explain any possibility of it.Thanks
Geoff Hayes
2014-8-26
Azizullah - you did not include an attachment.
As for the error, the AFMParser is part of the FontBox library. Did you add the FontBox jar file path to your Java class path? I looked at the pdfParsedemo.m script, and while it doesn't have a command to do so, you probably should. So if you updated
javaaddpath('M:\My Documents\MATLAB\PDF Exercise\PDFBox-0.7.3\lib\PDFBox-0.7.3.jar')
to the path on your workstation that corresponds to PDFBox-0.7.3.jar (or whatever the jar file is), then you should add an equivalent statement for the FontBox
javaaddpath('whateverYourPathIsTo\FontBox-someVersionIds.jar')
(I don't know what the name of the jar is, so FontBox-someVersionIds.jar is just an example.)
azizullah khan
2014-8-27
Yes.I did it as required.If there is any way to convert pdf into excel in matlab kindly share with me.For example: if we can load a pdf to another software with the help of matlab and then convert pdf into excel and got the output? IS it possible in matlab to operate another software?Thanks
Geoff Hayes
2014-8-27
Unfortunately, this is not something that I have considered and so am not aware of any other means of reading the pdf into MATLAB. You could always try the pdftotext program.
Naftali
2016-6-15
编辑:Naftali
2016-6-15
I am no expert but could not find a way to read a pdf file to Matlab. People talk here a bout text, but pdf is usually a series of pics. I go to professional adobe reader and export the pages of the pdf document either by file/save as or by Advanced/Export. This produces a png or jpeg file for each page of the document. From there it is easy in Matlab - loop over the pages with the imread function.
Walter Roberson
2016-6-15
pdf is effectively a programming language; you need to execute the commands in order to determine what the output is.
Stefanie Schwarz
2021-1-5
Following up with Naftali's comment, there is also a way to convert a PDF to an image file in MATLAB. See: https://www.mathworks.com/matlabcentral/answers/709623-how-can-i-convert-a-scanned-pdf-to-an-image-using-matlab
采纳的回答
Christopher Creutzig
2017-10-16
编辑:Walter Roberson
2017-11-4
Just for the record, Text Analytics Toolbox (new in R2017b) includes a function extractFileText that will extract text data from PDF (or MS Word) files.
更多回答(1 个)
另请参阅
类别
在 Help Center 和 File Exchange 中查找有关 Text Analytics Toolbox 的更多信息
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!发生错误
由于页面发生更改,无法完成操作。请重新加载页面以查看其更新后的状态。
您也可以从以下列表中选择网站:
如何获得最佳网站性能
选择中国网站(中文或英文)以获得最佳网站性能。其他 MathWorks 国家/地区网站并未针对您所在位置的访问进行优化。
美洲
- América Latina (Español)
- Canada (English)
- United States (English)
欧洲
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom(English)
亚太
- Australia (English)
- India (English)
- New Zealand (English)
- 中国
- 日本Japanese (日本語)
- 한국Korean (한국어)