Extracting text from PDF with extractFileText is not working for some PDF

12 次查看（过去 30 天）

mario 2023-10-17

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2034784-extracting-text-from-pdf-with-extractfiletext-is-not-working-for-some-pdf

回答： Christopher Creutzig 2023-12-11

I am using the extractFileText function to extract text from PDF files but with some files the function returns an empty string.

Through the function pdfinfo, I realized that the PDF files from which extractFileText cannot extract the text have different producer tag than those for which it works. In particular, it seems that extractFileText fails to extract the text in the case where the producer tag is Producer: "iText 2.1.7 by 1T3XT".

No error message is generated; you simply get an empty string.

Can anyone help me? Thank you!

7 个评论
显示 5更早的评论隐藏 5更早的评论

dpb 2023-10-18

在 MATLAB Online 中打开

cs 2023.01.03.pdf

pdfinfo('cs 2023.01.03.pdf')
ans = struct with fields:
                NumPages: 40
                PageSize: [40×4 double]
              PDFVersion: "1.4"
                   Title: ""
                 Subject: ""
                Language: ""
                Keywords: ""
                  Author: ""
                 Creator: ""
                Producer: "iText 2.1.7 by 1T3XT"
            CreationDate: 03-Jan-2023 03:17:20
        ModificationDate: 03-Jan-2023 03:17:20
               Encrypted: 0
    AllowsTextExtraction: 1
                Filename: "/users/mss.system.asxxnt/cs 2023.01.03.pdf"
extractFileText('cs 2023.01.03.pdf')
ans = ""

probably confirms identical symptoms you get locally. I did comment to a TMW staff member who responded to another Q? on reading pdf files to make aware of this issue if comes back on the presumption might have a specific interest in MATLAB pdf file functions.

extractFileText

mario 2023-10-19