Extract Text from Multiple PDF files across Multiple Folders

版本 1.0.0 (2.9 KB) 作者: Samuel Veith
When you have desired text in PDF documents saved in many folders
82.0 次下载
更新时间 2019/9/10

查看许可证

Hello! I put this code together because my company had a backlog of individual quotes we wanted to store in a single excel file. I'm sharing this code because we did not wish to pay for a conventional PDF text reader.

This code was primarily put together from two sources. I recommend reading the first link as you must download the toolbox for this code to work.
Extract text from single document*: https://www.mathworks.com/matlabcentral/fileexchange/19798-extract-text-from-a-pdf-document
Open files in multiple folders: https://www.mathworks.com/matlabcentral/answers/245959-how-to-read-text-files-from-different-sub-folders-in-a-folder

1. Download the code
2. Insert the file path for your PDFbox (line 10)
3. Pre-allocate a number of cells for an approximate number of PDFs you are trying to read (line 15)
4. Change your output text file name if you wish (line 99)
5. The default is for all PDF files in the chosen directory to be read. If you only wish to open files with a certain file pattern, adjust line 48.

General heads-ups:
a. I'm not a prolific coder, there's certainly some junk code in there!
b. Some users claim the PDF reader code dosen't work for them. It worked excellent my first time.
c. If you wish to write a separate text file for each PDF, bring the file write into the above loop
d. This does not process password protected files
e. There will be PDF java errors upon running this. You can ignore them.
f. I was not successful reading all my files. ~5% of them triggered the try/catch statement. Let me know if you figure out why!

*Note: there are newer versions of the toolbox than what is linked here. I cannot confirm if they are compatible or work better, although users appear to have success.

引用格式

Samuel Veith (2024). Extract Text from Multiple PDF files across Multiple Folders (https://www.mathworks.com/matlabcentral/fileexchange/72706-extract-text-from-multiple-pdf-files-across-multiple-folders), MATLAB Central File Exchange. 检索来源 .

MATLAB 版本兼容性
创建方式 R2019a
兼容任何版本
平台兼容性
Windows macOS Linux
类别
Help CenterMATLAB Answers 中查找有关 Text Files 的更多信息

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!
版本 已发布 发行说明
1.0.0