How to identify data set characteristics which influence the success of a model using those data sets as input.

Question

Wayne Martin 2024-4-22

0
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2110286-how-to-identify-data-set-characteristics-which-influence-the-success-of-a-model-using-those-data-set

评论： Wayne Martin 2024-5-3

I am studying the effect of hurricanes on coral reefs and have developed a damage prediction model which uses as inputs the fragility and distribution of different coral species at 150 post-storm survey sites. I can also create multiple simulated reefs by randomly assigning species, colonies and damage from the measured probability distribution functions of those parqameters for each species. When I make 1000 simulated reef experiments the results of my damage prediction are widly distributed from terrible to great. I need to mine the 1000 simultaed reefs to identify patterns which are influencing the success of the model. I expect this is a common scenario and would apprecieate any guidance on which tools to use and how to proceed. I have the statistics and machine learning toolbox.

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

Yatharth 2024-5-3

Hello Wayne,

To answer your question on how you can identify data characteristics which influence the success of a model.

You can perform some basic Exploratory Data Analysis (EDA) to understand the distributions of your parameters and outcomes, identify outliers, and see if there are any obvious patterns or correlations.

Use "histogram", "boxplot", or "scatter" functions to visualize the distributions of your parameters and outcomes.
Use "corrplot" to visualize correlations between parameters and between parameters and outcomes.

With many input parameters, it's crucial to identify which ones significantly impact the model's outcome. Feature selection techniques can help reduce dimensionality and focus on the most influential variables.

Use "sequentialfs" (sequential feature selection) to identify the most important features. This function can help you find a subset of the input variables that most effectively predict the outcome.
Consider using principal component analysis (PCA) with "pca" to reduce dimensionality and possibly uncover underlying patterns in your data.

Here are the links for some of the mentioned functions:

scatter: https://www.mathworks.com/help/matlab/ref/scatter.html
corrplot: https://www.mathworks.com/help/econ/corrplot.html
sequentialfs: https://www.mathworks.com/help/stats/sequentialfs.html
pca: https://in.mathworks.com/help/stats/pca.html

Here are some examples that might be useful in your case:

For feature selection: https://www.mathworks.com/help/stats/selecting-features-for-classifying-high-dimensional-data.html
For classification : https://www.mathworks.com/help/stats/classification-example.html
For cross validation: https://www.mathworks.com/help/stats/crossval.html

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

Wayne Martin 2024-5-3

Thank you very much! Wayne

请先登录，再进行评论。

How to identify data set characteristics which influence the success of a model using those data sets as input.

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

采纳的回答

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

更多回答（0 个）

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

How to identify data set characteristics which influence the success of a model using those data sets as input.

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

采纳的回答

1 个评论 显示 -1更早的评论隐藏 -1更早的评论

更多回答（0 个）

另请参阅

类别

标签

产品

版本

Community Treasure Hunt

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

1 个评论
显示 -1更早的评论隐藏 -1更早的评论