Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Converted version outputs index of class instead of class

See original GitHub issue

If I train a classifier with non-consecutive numbers for classes, the resulting converted code (C in my case) will not output the classes but the index of the class. In my case I simply don’t have an example for class 1 in all cases, so the classifier will not know this class exists. This creates discrepancies between Python and C.

from sklearn.ensemble import RandomForestClassifier
# linear mapping: x->x
# NB: my goal is not regression, this is just an example
x_train = np.repeat([0,1,2,3,4,5], 100).reshape([-1,1])
y_train = np.repeat([0,1,2,3,4,5], 100)

# however, class 1 is missing in training!
x_train = x_train[y_train!=1]
y_train = y_train[y_train!=1]

clf = RandomForestClassifier().fit(x_train, y_train)

# convert it
code = m2cgen.export_to_c(clf)

result = clf.predict(np.atleast_2d([0,1,2,3,4,5]).T)
# result =[0,0,2,3,4,5]

Calling it in C will give different results

# Pseudocode for C
double result[5] = score([0,1,2,3,4,5])

#result = [0,0,1,2,3,4]

Do you think there is any feasible way to keep original class label?

(see also https://github.com/nok/sklearn-porter/issues/37 having the same problem)

Issue Analytics

State:
Created 5 years ago
Comments:7 (3 by maintainers)

Top GitHub Comments

1reaction

skjernscommented, May 13, 2019

by the way, here is the final code wrapper that I came up with

Wrapper to keep class labels

def save_model_m2cgen(clf, file):
    """
    Converts and saves python to C code with m2cgen backend.
    I've hacked quite a few custom functions into the code.
    This code will probably only work with RandomForestClassifier.
    You might need to write a new main method for another classifier.
    """
    supported = ['LogisticRegression','RandomForestClassifier','SGDClassifier'
                 'RidgeClassifier', 'DecisionTreeClassifier', 'PassiveAggressiveClassifier' ]
    if clf.__class__.__name__ not in supported:
        log.warn('{} is not supported by m2cgen.'
                 ' Or maybe it is and needs to be added to the supported list.'.format(clf.__class__.__name__))
        return False
    code = m2cgen.export_to_c(clf)
    code = code.replace('#include <string.h>', '') # remove and add later
    
    # see which labels are in the classifier, so far only ints are supported
    labels = [str(int(i)) for i in clf.classes_]
    n_classes = len(labels)
    n_feats   = clf.n_features_
    
    
    ## Now we add some extra code for interfacing with the function
    include = '#include <stdlib.h>\n#include <stdio.h>\n' \
              '#include <math.h>\n#include <string.h>\n'
              
    # create static variables and conversion from classidx to label
    definitions = 'const int n_classes = {};\nconst int n_features = {};\n' \
                  'const int labels[{}] = {{{}}};'.format(
                   n_classes, n_feats, n_classes, ','.join(labels))
    
    # add an argmax function to get from probabilities to classes
    argmax ='''int argmax(double * output){{     
    double max = 0.0;                 
    int iargmax = 0;               
    for (int i=0; i<{}; i++){{        
        if (output[i]>max){{          
            max = output[i];
            iargmax = i;              
        }}                            
    }}
    return iargmax;\n}}\n''' .format(n_classes)
        

    predict = '''int predict (double features[]) {{
    double * output = predict_proba(features);
    int class_idx = argmax(output);
    int label = labels[class_idx];
    return label;\n}}\n'''.format(n_feats)
    
    predict_proba = '''double * predict_proba (double features[]) {{
    static double output[{}] = {{0}};
    score(features, output); 
    return output;\n}}\n'''.format(n_feats)
    
    ## here we append a main() method so that we can receive the data from a cli
    main = '''int main(int argc, const char * argv[]) {{
    if (argc-1 != {}){{
            printf("Need to supply {} features, %d were given", argc-1);
            return 1;
        }}
    
    double features[argc-1];
    for (int i = 1; i < argc; i++) {{
        features[i-1] = atof(argv[i]);
    }}

    // calculate outputs for debugging
    double * output = predict_proba(features);
    // same as calling label = predict(features)
    int class_idx = argmax(output);
    int label = labels[class_idx];
    
    // now we print the results
    printf("labels: {}\\n");
    printf("probabilities: ");
    for (int i=0; i<{}; i++){{        
        printf("%f ", output[i]);
    }}
    printf("\\n");
    printf("class_idx: %d\\n", class_idx);
    printf("label: %d", label);
    return 0;\n}}'''.format(n_feats, n_feats, labels, n_classes)
      
    final_code = '\n'.join([include, definitions, code, argmax, 
                            predict_proba, predict, main])
    
    with open(file, 'w') as f:
        f.write(final_code)
        
    return True

1reaction

izeigermancommented, Mar 15, 2019

@skjerns thanks for reporting this issue! This is indeed an interesting use case and the code generated by m2cgen in this scenario produces an array with class probabilities where classes are represented by their corresponding indexes in the original model object.

We should think of how to address this properly. Meanwhile I can suggest you the following steps as a workaround:

Once you generated the code you can manually add a list of labels to it. Like this:

clf = RandomForestClassifier().fit(x_train, y_train)

# convert it
code = m2cgen.export_to_c(clf)

code += '\n'
code += 'const char *LABELS[] = { %s };' % ', '.join(['"' + str(c) + '"' for c in clf.classes_])

Somewhere in your C code you can use extern to link that constant:

extern const char *LABELS[];

Now you can use indexes provided by the score function to access corresponding labels like eg.

const char *cls = LABELS[<index_from_score>];

Please let me know if the proposed solution worked for you.

Top Results From Across the Web

String (Java Platform SE 8 ) - Oracle Help Center

Index values refer to char code units, so a supplementary character uses two positions in a String . The String class provides methods...

3. Data model — Python 3.11.1 documentation

If a class attribute is found that is a user-defined function object, it is transformed into an instance method object whose __self__ attribute...

as.data.frame: Coerce to a Data Frame - Rdrr.io

Character variables are converted to factor columns unless protected by I . If a data frame is supplied, all classes preceding "data.frame" are...

Convert pandas dataframe to NumPy array - Stack Overflow

24.0 introduced two new methods for obtaining NumPy arrays from pandas objects: to_numpy() , which is defined on Index , Series , and...

Character array - MATLAB - MathWorks

C = char( A ) converts the input array, A , to a character array. ... C — Output array character array ......

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Converted version outputs index of class instead of class

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

m2cgen output for xgboost with binary:logistic objective returns raw (not transformed) scores

RuntimeError when using cell2location on MacOS (M1)