Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

What is the right approach to handle categorical features with Dataset of lightfm ?

See original GitHub issue

With a data frame like that:

user_features = [{"user_id":"1","user_city":"X"},
{"user_id":"2","user_city":"Y"},
{"user_id":"3","user_city":"Z"}]

To use user_city feature with Dataset of lightfm, we have 2 ways:

Method1: Set each X, Y, Z values as an independent features ( one hot encode ) to fetch into dataset, like that ( it is written on lightfm docs):

dataset.fit( ..., user_features=['X', 'Y', 'Z']) user_features = dataset.build_user_features( [('1', ['X']), ('2', ['Y']), ('3', ['Z'])] )
Method2: Encode X as 1, Y as 2, Z as 3 and fetch for only one feature: user_city:

dataset.fit( ..., user_features=['user_city']) user_features = dataset.build_user_features( [('1', { 'user_city':1 }), ('2', {'user_city': 2}), ('3', {'user_city': 3}) ] )

In my problem, each features can have so many distinct values by using method1, it can create a huge matrix, so it will slow down computation. Also, i don’t know how Lightfm model will handle addition features if using method2 , because it is not recommended in documents, it can reduce model performance or accuracy.

So which approach i should choose ?

Issue Analytics

State:
Created 5 years ago
Reactions:1
Comments:11

Top GitHub Comments

1reaction

DoronGicommented, Sep 5, 2018

I can understand why you would want to avoid method1 (it wouldn’t only slow down computation; unless you have plenty of users in each city, your model won’t be able to learn significant representation for most city features) , but I don’t think method2 makes any sense. As far as I understand LightFM will construct the representation for each user as a linear combination of its features representations, and you suggest to use a different weight for the user_city for each city. If you arbitrarily assign numbers to cities I can’t see how that can help your model understand the user better.

You could however try different approaches:

Define few city features which you think are relevant to you problem (e.g. big_metropolis, small_village, coastal etc.) and map each city to the corresponding feature (one hot encoded).
Define even fewer city features (e.g. big_city) and map each city to a weight of the relevant features, such that New York would get a higher big_city weight than Seattle.

Keep in mind that the model can only be as good as the data you provide.

0reactions

SimonCWcommented, Jan 23, 2021

I’m closing this issue because it has been inactive for a long time. If you still encounter the problem, please open a new issue.

Thank you!

Top Results From Across the Web

What is the correct way to handle cross sectional features in ...

I am trying to build a hybrid recommender system in python based on the lightFM library. The input data contains information on users, ......

Handling Categorical Data, The Right Way

Categorical data is simply information aggregated into groups rather than being in numeric formats, such as Gender, Sex or Education Level.

lightfm-rec/Lobby - Gitter

I have some boolean features, continuous features, and categorical features. What is the best format to incorporate these type of features into lightFM?...

How to Handle Categorical Features | Analytics Vidhya

This article contains the different methods and techniques to handle categorical features.

Multimodal Data Fusion in High-Dimensional Heterogeneous ...

The proposed algorithm is presented in detail for the commonly encountered heterogeneous datasets with real-valued (Gaussian) and categorical (multinomial) ...