Average with negative weights
See original GitHub issueProblem description
In several contexts, we discussed how to compute weighted averages when the weights are negative. The prime use case for this is an average carbon price over several regions, weighted by the (possibly negative) emissions in each region.
The current implementation of the aggregate_region implements a direct weighted sum:
value * weight / sum(weight)
This can lead to very counter-intuitive results - see more below.
Illustration
Use the following snippet to explore the current behaviour:
def test_aggregate(value, weight):
TEST_DF = pd.DataFrame([
['reg_a', 'Price|Carbon', 'USD/t CO2', value[0]],
['reg_b', 'Price|Carbon', 'USD/t CO2', value[1]],
['reg_a', 'Emissions|CO2', 'Mt CO2', weight[0]],
['reg_b', 'Emissions|CO2', 'Mt CO2', weight[1]],
],
columns=['region', 'variable', 'unit', 2010],
)
return pyam.IamDataFrame(TEST_DF, model='model_a', scenario='scen_a')\
.aggregate_region('Price|Carbon', weight='Emissions|CO2')._data.iloc[0]
- If the sum of the weights is zero, the resulting average price is infinite
test_aggregate(value=[1, 2], weight=[-1, 1]) >>> inf - Depending on the order of the weight vector, it can also be negative infinity.
test_aggregate(value=[1, 2], weight=[-1, 1]) >>> -inf - If
sum(weight) == 0andvalue * weight == 0, the returned value isnan, which is dropped when initiatilizing the IamDataFrame.test_aggregate(value=[1, 1], weight=[-1, 1]) >>> Error - Other values can lead to situations where the “average” price is outside the range
min(value), max(value), e.g.test_aggregate(value=[1, 2], weight=[-1, 2]) >>> 3
Possible solutions
-
Leave as is, because it’s up to the users to know this (including modeling teams uploading to the IIASA Scenario Explorer).
-
Leave as is, but write a warning message when weights are negative (question is whether users just ignore this).
-
Raise an error when any weight is negative, forcing users to explicitly choose an alternative approach (average, min, max).
-
Use the absolute of the weight vector. If there are three regions with emissions -1, 0 and 1, this would mean that the prices of region 1 and 3 are given equal weight and region 2 is ignored. This seems wrong…
-
Redefine the mathematical expression to
value * weight / sum(abs(weight))
However, this can also lead to strange outcomes, e.g., if all regional values are identical, and the weighted average is different.
-
Apply normalization (recalculate the weight vector such that
min(weight) == 0andmax(weight) == 1, leaving the relative distribution as is. This means that the price of the region with the lowest emissions is ignored (because its weight is 0). -
Apply some shift/offset to make all weights positive. Suggestion by @byersiiasa:
weight = weight + 2 * min(weight). The vectorweight=[-1, 2]would become[1, 4].test_aggregate(value=[1, 2], weight=[1, 4]) >>> 1.8This has the drawback that the offset multiplier is arbitrary - using a multiplier of 3 instead of 2 gives different results…
test_aggregate(value=[1, 2], weight=[2, 5]) >>> 1.7142 -
Added Write a warning when weights are negative and add a keyword-argument whether to drop (default) or not any values aggregated with negative weights.
-
Added Same as 8, but default is to include aggregation with negative weights.
-
Any other ideas…?
Note
The function numpy.avg() faces the same problem. The devs decided to leave the unintuitive behaviour (solution 1 above), see https://github.com/numpy/numpy/issues/9825.
Issue Analytics
- State:
- Created 3 years ago
- Comments:9 (3 by maintainers)

Top Related StackOverflow Question
Principle of minimum surprise would suggest: don’t do something different than numpy, so certainly not (4) through (7). Definitely do not have different behaviours depending on the particular variable names in use, as suggested in the last two comments.
Of (1), (2), and (3), I would exclude (3), since the numpy thread points out this is mathematically valid, albeit perhaps without a common or obvious application in IA modeling. Between (1) and (2) it’s a toss-up and I think depends on the intended user base; if they’re likely to make this mistake and unlikely to read the docs carefully, then a warning could help.
ETA: on closer reading, the example above shows that
±Infis returned when weights sum to zero, whereas numpy raisesZeroDivisionError. Maybe consider removing this discrepancy.I agree this all makes from software perspective.
How / where to have the discussion regarding actual implementation to resolve the problematic issue of when the desirable weights , e.g. Emissions, go negative?