plotting different imputing strategy
See original GitHub issueNote: The idea is inspired from a lecture of andreas muller.
Describe the solution you’d like The idea is to have a visual look on how closely a particular imputer imputes given feature columns.
Is your feature request related to a problem? Please describe. It gives a quick and good visual representation about how different imputation strategy works for the given feature columns of the data.
Examples
In the below image I took the iris data and added nan
to it across various rows. Then I construct a function which plots on how various imputation strategies impute the given 2 columns col1
and col2
(in case of iris I used petal length
and petal width
). For iris I used 3 different imputation strategies mentioned in the image.
The code I used for this visualization is below( note, for now this code is just for demonstration purpose and it can be improved ),
def get_full_and_nan_rows(X, col1, col2):
"""
returns 2 lists,
full_rows, which contains the indices of non-nan rows along given 2 columns.
nan_rows, which contains the indices of nan rows along given 2 columns.
"""
full_rows = []
nan_rows = []
for ind, row in enumerate(X):
if any(np.isnan(row[[col1, col2]])):
nan_rows.append(ind)
else:
full_rows.append(ind)
return full_rows, nan_rows
@ignore_warnings(category=ConvergenceWarning)
def plot_2D_imputation(X, y, col1, col2, imputer, xlabel='', ylabel='', title='', figsize=(5,5), alpha=0.6, s=80):
full_rows, nan_rows = get_full_and_nan_rows(X, col1, col2)
X_imp = imputer.fit_transform(X)
ax.scatter(X_imp[full_rows, col1], X_imp[full_rows, col2], c=plt.cm.tab10(
y[full_rows]), alpha=alpha, s=s, marker='o')
ax.scatter(X_imp[nan_rows, col1], X_imp[nan_rows, col2], c=plt.cm.tab10(
y[nan_rows]), alpha=alpha, s=s, marker='s')
ax.set_xlabel(xlabel)
ax.set_ylabel(ylabel)
ax.set_title(title)
Issue Analytics
- State:
- Created 4 years ago
- Comments:11 (7 by maintainers)
Top GitHub Comments
Hello @greatsharma and thanks for checking out Yellowbrick! @bbengfort and I are both currently traveling, so it may take us a week or more to respond. We appreciate your patience and your feature suggestion!
Hi @greatsharma sorry, it’s taken me so long to respond - my GitHub emails got pretty buried. In principle, I’m fine with the approach that you mentioned. My only comment is to remove the
plot_2d
from the function name, so far we’ve chosen to pass 2d or 3d as a parameter to visualizers that do 2d or 3d visualization (see the PCA visualizer for an example). And if the2d
is removed, thenplot
becomes redundant.We would be interested in seeing some prototypes of this suggestion as a next step!