Reproducible Research Workflow in MATLAB
2 次查看(过去 30 天)
显示 更早的评论
I'm intrigued by this article about Reproducible Research with R by Lincoln Mullen and would like to try the same idea out, but in MATLAB.
In the workflow for my current project, I am performing several long-running queries on a database and storing the results in .mat files locally (call this process A). Then I do some post-processing, which also takes a while, and then cache those results locally again (call this B). Finally, I produce some plots and a custom output file that will feed into some other analysis tools (call this C).
Currently, if I change process C, I'll reload the results from B and then redo C. Similar for B and A. This saves me time, but as my code becomes a bit more complex, I suspect I'm at risk of changing some earlier code in A or B and then forgetting to re-run them, perhaps just re-running C.
If I could define the results files as targets in a way similar to GNU Make, then I'd be able to ensure that I don't forget to re-do any steps needed, without ever needlessly waiting for these long-running processes to complete. If someone else wants to reproduce my result, they should not have to learn how to run the several commands in sequence - they should just have to run make or something equivalent.
Driving MATLAB with Make itself doesn't seem like a solution because of MATLAB's long start-up time. I'm also on a Windows machine for this work, so it would be inconvenient to set that up. So I believe I should be running something in MATLAB to do this. Does anybody know of a MATLAB library or similar tool that can support that kind of workflow?
0 个评论
回答(1 个)
Jonathan A
2019-4-11
I was looking for similar tools in Matlab a few years ago. However the one I found were only focused on one part of the solution to implement the kind of behavior you are describing. Therefore, I tried to implement a class which performs automatic persistent memoization by defining a directed acyclic graph (DAG) where nodes are Matlab functions. If the node main function and all the sub-functions called during the node execution (a) and the input variables of the node (b) remain unchanged, then the results are retrieved from the disk and not re-computed. This holds true even across different Matlab processes as it is based on the file system.
For (a), the major time consumer is the code analysis to find out the sub-functions involved in the main function execution. Therefore, I also persisted this "dependency" information based on the last modified date of the function files.
For (b), either the variable content is hashed or the variable mat-file date is taken to check that the variables did not change
Moreover, I wanted to declare the DAG only with Matlab and not driving it with Make as you wrote and also wanted to visualize the execution of the DAG. Therefore, I implemented the following class: https://de.mathworks.com/matlabcentral/fileexchange/71180-explore
Not sure if it is the kind of tool you are looking for, but if in the meantime you found other solutions or other relevant discussions, I would be curious to know about :-) Thanks !
另请参阅
类别
在 Help Center 和 File Exchange 中查找有关 File Operations 的更多信息
产品
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!