Modulo parameter constraints
See original GitHub issue1. modulo
hidden_size % num_attention_heads == 0
use case
In transformer models, we use multi-head attention which the hidden vector will be divided into n part, and n is number of attention heads, so we usually want hidden_size divisible by num_attention_heads
2. log2
math.log2(batch_size) % 1 == 0
use case
To make batch size be 2**n to just fit in memory.
3. Why?
Can we just pass a function as parameter constraint and have all parameters’ names as arguments?
Issue Analytics
- State:
- Created 4 years ago
- Reactions:1
- Comments:5 (4 by maintainers)
Top Results From Across the Web
Using modulo operator in constraint programming
Here eps is a parameter representing a positive value close to 0, e.g., 1e-4. If Z is integer then eps can be 1....
Read more >How to model a modulo with linear constraints?
In a MIP model, the constraint y=x mod a. can be written as x=k⋅a+yk integer 0≤y≤a−0.001. Here k is an extra integer variable....
Read more >Modulo operation Constraint? - Gurobi Support Portal
During the optimization process, Gurobi looks for values to assign to all of the model variables such that all of the model constraints...
Read more >Can a modulo operation be expressed as a ... - IBM Community
I need to add a constraint that decision variable with modulo expression.I noted that in Stackoverflow there is no direct method we cannot...
Read more >Can a modulo operation be expressed as a constraint in ...
In your specific case, the constraint got simplified because one of the rhs terms was 0. Instead, more generally, if you had b1...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Can you just reparameterize your problem in this case? Like define integer parameters
size_per_head
andlog_batch_size
and then in your evaluation code havehidden_size = size_per_head * num_attention_heads
batch_size = 2**log_batch_size
Or would that mean you need a joint constraint on
size_per_head
andnum_attention_heads
?@sdsingh From an optimization/candidate generation perspective (other than difficulty of the problem) there is no issue with imposing non-linear constraints on the parameter space. I assume the linearity assumption is mainly imposing structure for representing the constraints in Ax?
All parameter constraints are implemented as linear constraints in the modeling layer. When we pass points into our GPs, they are all normalized to [0,1]^d, where then our constraints are applied. You can see how these are implemented w/ a simple matrix multiply in the Botorch model. As a result, we can only support constraints that can be mapped into a linear constraint, however creatively that may be. Unfortunately, I don’t see a way of doing that for either of these constraint types.
Both of these constraint types seem pretty useful, though. Let me think about what we can do here, but unfortunately I don’t think this will be a quick fix.