Replace specific loops with Boolean masking and vectorized assignments#8045
Replace specific loops with Boolean masking and vectorized assignments#8045mhucka wants to merge 4 commits intoquantumlib:mainfrom
Conversation
This amounts to a simple bit of micro-optimization. With the help of Gemini, I looked for cases where some simple loops could be rewriten to use NumPy's `eye()` operator if appropriate. There were a few cases, and that's what this PR is about.
eye() for some simple optimizationseye() if possible
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #8045 +/- ##
==========================================
- Coverage 99.63% 99.63% -0.01%
==========================================
Files 1110 1110
Lines 99795 99787 -8
==========================================
- Hits 99435 99425 -10
- Misses 360 362 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
eye() if possible| matrix = np.copy(matrix) | ||
| for i in range(min(matrix.shape)): | ||
| matrix[i, i] = 0 | ||
| matrix[np.eye(*matrix.shape, dtype=np.bool)] = 0 |
There was a problem hiding this comment.
This changes assignment cost from O(N) to O(N^2) and allocates an extra N^2 sized temporary matrix. Note that matrix gets copied one more time in all_near_zero.
Here is a version that does away with one matrix copy w/r to the initial code:
absolute = np.abs(matrix)
np.fill_diagonal(absolute, 0)
return tolerance.near_zero(np.max(absolute, initial=0), atol=atol)And here are timings for original, PR, and suggestion:
In [1]: a = np.eye(2048, dtype=float)
In [2]: %timeit is_diagonal(a)
17.9 ms ± 113 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [3]: %timeit is_diagonal1(a)
18.4 ms ± 21.8 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [4]: %timeit is_diagonal2(a)
8.9 ms ± 38.2 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
| for i in range(width): | ||
| result[(i + shift) % width, i] = 1 | ||
| return result | ||
| return np.roll(np.eye(width), shift, axis=0) |
There was a problem hiding this comment.
NVM this is a test helper function so its performance does not matter much.
This has to copy N^2 values (compared to N assignments before) and allocate 2 arrays instead of 1. It is about 6 times slower than the original. Here is a better way of vectorizing Python loop:
| return np.roll(np.eye(width), shift, axis=0) | |
| result = np.zeros(width * width) | |
| shift = shift % width if width else 0 | |
| # lower diagonal starting at the first column | |
| result[shift * width :: width + 1].fill(1) | |
| # upper diagonal starting at the first row | |
| if shift: | |
| result[width - shift : shift * width : width + 1].fill(1) | |
| return result.reshape((width, width)) |
Again, timings for the original, PR, and suggestion:
%timeit shift_matrix(8, 3)
1.46 μs ± 66.1 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
8.13 μs ± 305 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
1.1 μs ± 26.4 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
%timeit shift_matrix(1024, 3)
475 μs ± 4.82 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
3.69 ms ± 8.34 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
298 μs ± 1.85 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
| N = 2**self._num_qubits | ||
| return self._state[None, :, None] * np.eye(N, dtype=self._state.dtype)[:, None, :] |
There was a problem hiding this comment.
This allocates an extra temporary matrix of N^2 elements which gets multiplied N times. The overall complexity is increased from O(N^2) to O(N^3) with over 50% slowdown.
Here is a version which is comparable and in some cases faster than the original - depending on num_qubits:
N = 2**self._num_qubits
operator = np.zeros(shape=(N, N, N), dtype=self._state.dtype)
idx = np.arange(N)
operator[idx, :, idx] = self._state
return operatorHere are timings for original, PR, suggestion:
# num_qubits = 2, N = 4
2.17 μs ± 67.6 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
3.51 μs ± 96.5 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
2.29 μs ± 43.2 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
# num_qubits = 4, N = 16
6.95 μs ± 189 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
10.8 μs ± 114 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
3.48 μs ± 206 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
| new_xs = np.zeros((2 * self.n + 1, self.n), dtype=bool) | ||
| for i in range(self.n): | ||
| new_xs[i, i] = True | ||
| new_xs[: self.n, : self.n] = np.eye(self.n, dtype=bool) |
There was a problem hiding this comment.
This increases number of assignments from N to N^2, but runs still faster for a bit array. That said, np.fill_diagonal does the job with O(N) complexity
np.fill_diagonal(new_xs[: self.n, :], True)Timing for 20 qubits for original, PR, suggestion:
4.01 μs ± 98.7 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
3.5 μs ± 29 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
2.57 μs ± 7.33 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
| new_zs = np.zeros((2 * self.n + 1, self.n), dtype=bool) | ||
| for i in range(self.n): | ||
| new_zs[self.n + i, i] = True | ||
| new_zs[self.n : 2 * self.n, : self.n] = np.eye(self.n, dtype=bool) |
There was a problem hiding this comment.
| new_zs[self.n : 2 * self.n, : self.n] = np.eye(self.n, dtype=bool) | |
| np.fill_diagonal(new_zs[self.n : 2 * self.n, :], True) |
pavoljuhas
left a comment
There was a problem hiding this comment.
LLMs seems very capable of making suggestions that look good at a first sight, but in reality make things worse. We should be careful about quantifying any supposed code optimizations and only accept them if they are truly change for better.
Before doing such timings, we should also consider if the modified code is speed critical. If not the changes are probably not worth the effort, unless they make the code clearly easier to read which is again not much of a strength of LLMs.
This amounts to a simple bit of micro-optimization. With the help of Gemini, I looked for cases where a specific type of matrix assignment was implemented in a
forloop that could be rewritten to use a Boolean mask and vectorized assignment. There were a few cases, and that's what this PR is about.