Performance on new data is always going to be worse than on training
data regardless whether your model is overfitted or not.
ML 101 method of identifying overfitting is plotting your metric value
by epoch (or by # of trees in your case) on evaluation or validation
dataset and looking at average slope of the curve towards the end. If
the curve is going up then you're underfitting; if it's going down then
you're overfitting. And if it's flat, then you found sweet spot.
As for feature importance, many decision tree libraries can provide you
this info. I know xgboost does. Never worked with sklearn, but a cursory
search revealed this page:
https://stackoverflow.com/questions/49170296/scikit-learn-feature-importance-calculation-in-decision-trees
So it appears to be available in sklearn as well.
HTH
--Tony
On 1/2/2022 3:45 PM, pranav@xxxxxxxxxxxxxxxxx wrote:
Hi all,** To leave the list, click on the immediately-following link:-
I have built a decision tree in python using the sklearn library. How do I
evaluate if it is overfitting or underfitting?
I also need to do some feature engineering in terms of determining which
features are contributing most to my model. I am thinking about using the
shap values but if I print them, I get numbers which I am not sure how to
interpret.
Pranav
** To leave the list, click on the immediately-following link:-
** [mailto:program-l-request@xxxxxxxxxxxxx?subject=unsubscribe]
** If this link doesn't work then send a message to:
** program-l-request@xxxxxxxxxxxxx
** and in the Subject line type
** unsubscribe
** For other list commands such as vacation mode, click on the
** immediately-following link:-
** [mailto:program-l-request@xxxxxxxxxxxxx?subject=faq]
** or send a message, to
** program-l-request@xxxxxxxxxxxxx with the Subject:- faq