What is the right approach to handle categorical features with Dataset of lightfm ?
See original GitHub issueWith a data frame like that:
user_features = [{"user_id":"1","user_city":"X"},
{"user_id":"2","user_city":"Y"},
{"user_id":"3","user_city":"Z"}]
To use user_city
feature with Dataset of lightfm, we have 2 ways:
-
Method1
: Set each X, Y, Z values as an independent features ( one hot encode ) to fetch into dataset, like that ( it is written on lightfm docs):dataset.fit( ..., user_features=['X', 'Y', 'Z'])
user_features = dataset.build_user_features( [('1', ['X']), ('2', ['Y']), ('3', ['Z'])] )
-
Method2
: Encode X as 1, Y as 2, Z as 3 and fetch for only one feature:user_city
:dataset.fit( ..., user_features=['user_city'])
user_features = dataset.build_user_features( [('1', { 'user_city':1 }), ('2', {'user_city': 2}), ('3', {'user_city': 3}) ] )
In my problem, each features can have so many distinct values by using method1
, it can create a huge matrix, so it will slow down computation. Also, i don’t know how Lightfm model will handle addition features if using method2
, because it is not recommended in documents, it can reduce model performance or accuracy.
So which approach i should choose ?
Issue Analytics
- State:
- Created 5 years ago
- Reactions:1
- Comments:11
Top GitHub Comments
I can understand why you would want to avoid
method1
(it wouldn’t only slow down computation; unless you have plenty of users in each city, your model won’t be able to learn significant representation for most city features) , but I don’t thinkmethod2
makes any sense. As far as I understand LightFM will construct the representation for each user as a linear combination of its features representations, and you suggest to use a different weight for theuser_city
for each city. If you arbitrarily assign numbers to cities I can’t see how that can help your model understand the user better.You could however try different approaches:
Define few city features which you think are relevant to you problem (e.g.
big_metropolis
,small_village
,coastal
etc.) and map each city to the corresponding feature (one hot encoded).Define even fewer city features (e.g.
big_city
) and map each city to a weight of the relevant features, such that New York would get a higherbig_city
weight than Seattle.Keep in mind that the model can only be as good as the data you provide.
I’m closing this issue because it has been inactive for a long time. If you still encounter the problem, please open a new issue.
Thank you!