NumExpr is faster than NumPy, and you’ve probably never heard of it
I recently came across a post comparing NumExpr to NumPy .
Yep, it’s fast. Not “we shaved 10% off by refactoring” fast - but “2-15x faster than NumPy” fast. No, it’s nothing new.
I originally discovered NumExpr while digging through Pandas source code. It’s been around since 2007 - older than Instagram, Uber, and most of JavaScript frameworks that have already died and been forgotten.
Here’s the kicker: if you’ve ever used df.eval() or df.query() in Pandas, that uses NumExpr under the hood. It’s been hiding in plain sight, quietly powering one of the most popular data manipulation libraries in Python.
So why isn’t everyone using NumExpr? And more interestingly - how does a library from 2007 still outperform modern alternatives?
The problem NumExpr solves
Let’s start with a simple NumPy expression:
import numpy as np
a = np.random.rand(10_000_000)
b = np.random.rand(10_000_000)
result = 2*a + 3*bThis looks clean. It’s vectorized. It’s exactly what NumPy was designed for. But under the hood, NumPy is doing something wasteful:
- Calculate
2*a→ creates temporary array #1 - Calculate
3*b→ creates temporary array #2 - Add them together → creates temporary array #3 (result)
For 10 million elements of float64, that’s:
- 10M * 8 bytes * 3 arrays = 240 MB of memory allocations
- Three separate passes through memory
That’s wasteful, sure, but more importantly- it’s slow. Modern CPUs are fast at computation but surprisingly slow at memory access. My R5500 is quite fast; 36.8 GFLOPS but if it’s constantly waiting for data from RAM, it’s spending most of its time idle.
This is where NumExpr comes in:
import numexpr as ne
result = ne.evaluate("2*a + 3*b")Same result. But instead of three temporary arrays, NumExpr uses zero. It processes the expression in chunks small enough to fit in CPU cache, one chunk at a time.
The performance difference?
import numpy as np
import numexpr as ne
import timeit
ne.set_num_threads(1)
a = np.random.rand(10_000_000)
b = np.random.rand(10_000_000)
t_numpy = timeit.timeit(
stmt="2*a + 3*b",
globals=globals(),
number=10
)
t_numexpr = timeit.timeit(
stmt='ne.evaluate("2*a + 3*b")',
globals=globals(),
number=10
)
print(f"NumPy: {t_numpy / 10:.6f} s per run")
print(f"NumExpr: {t_numexpr / 10:.6f} s per run")
#NumPy: 0.045659 s per run
#NumExpr: 0.026854 s per runIt’s almost twice as fast.
How NumExpr actually works
The clever bit isn’t just avoiding temporary arrays. NumExpr does something genuinely interesting with how it processes expressions.
When you call ne.evaluate("2*a + 3*b"), here’s what happens:
Step 1: Compile to bytecode
NumExpr takes the string expression and doesn’t compile to Python bytecode, but its own instruction set. Think of it like a tiny VM just for array operations. And this compilation happens only the first time you use an expression. It’s cached after that.
Step 2: Chunk the arrays
Instead of processing all 10 million elements at once, NumExpr splits arrays into chunks of 4,096 elements (by default). This isn’t arbitrary - 4,096 * 8 bytes = 32 KB, this fits comfortably in a typical CPU’s L1 cache.
Step 3: Process chunks through VM
For each chunk:
Load a[0:4096] into register r0
Load b[0:4096] into register r1
Multiply r0 by 2 → r2
Multiply r1 by 3 → r3
Add r2 and r3 → r4
Store r4 to result[0:4096]Then move to the next chunk. Rinse and repeat.
Step 4: Multi-thread across chunks
The interesting part. Each chunk is independent, so NumExpr can process multiple chunks in parallel across CPU cores. This is written in C and bypasses Python’s GIL. A simple expression like 2*a + 3*b can scale almost linearly with core count (at least until you hit memory bandwidth limits).
Why NumPy can’t do this
So why doesn’t NumPy just implement the same optimizations?
The answer is backwards compatibility and API design. NumPy’s whole selling point is that operations on arrays look like operations on scalars. You can write 2*a and get back an array. This is incredibly convenient, but it means each operation has to materialize a complete result.
NumExpr requires you to express your entire computation as a string: "2*a + 3*b". This looks weird at first, but it’s necessary for the optimization to work. The library needs to see the whole expression to avoid creating intermediates.
Pandas bridges this gap nicely with eval():
df.eval('C = 2*A + 3*B')This looks almost like normal Python but compiles to NumExpr bytecode under the hood. You get the performance without the weird string syntax.
When NumExpr actually matters
1. Large arrays that don’t fit in cache
With small arrays (< 100K elements), NumPy is often faster. The overhead of NumExpr’s VM setup eats into the gains. But once your arrays are larger than CPU cache, NumExpr wins.
# small arrays
a = np.random.rand(1000)
b = np.random.rand(1000)
# NumPy: 0.000009 s per run
# NumExpr: 0.000075 s per run; actually slower
# large arrays
a = np.random.rand(10_000_000)
b = np.random.rand(10_000_000)
# NumPy: 0.048272 s per run
# NumExpr: 0.026165 s per run; faster2. Complex expressions
More operations in your expression, more temporary arrays NumPy creates, the bigger the NumExpr advantage:
# simple expression
%timeit ne.evaluate("2*a + 3*b") # ~2x speedup
# complex expression
%timeit ne.evaluate("a*b - 4.1*a > 2.5*b") # 2.31x speedup3. Broadcasting operations
NumPy broadcasting is elegant but creates temporary arrays. NumExpr broadcasts on-the-fly:
a = np.arange(1000)
b = np.arange(1_000_000).reshape(1000, 1000)
stmt = "a * (b + 1)"
print(f"NumPy: {t_numpy / 10:.6f} s per run")
print(f"NumExpr: {t_numexpr / 10:.6f} s per run")
# NumPy: 0.003560 s per run
# NumExpr: 0.002834 s per run; fasterThe Polars question
Okay, but what about Polars? That’s the new hot thing for DataFrame operations.
Polars is indeed fast - it’s written in Rust, uses Apache Arrow, and has a query optimizer that can push down predicates and eliminate intermediate allocations. For DataFrame operations, it’s often faster than Pandas + NumExpr. But, it’s like comparing apples to oranges.
Polars and NumExpr solve different problems:
- Polars: Optimized for structural operations (filtering, grouping, joining DataFrames)
- NumExpr: Optimized for numerical operations (element-wise math on arrays)
If you’re doing df.filter(pl.col("age") > 25).groupby("city").agg(...), use Polars. If you’re doing df["result"] = 2*df["a"]**2 + 3*df["b"]**2 - 4*df["c"], NumExpr will often be faster.
Why NumExpr is still relevant in 2025
A library from 2007 shouldn’t still be competitive with modern tools. The reason why NumExpr is still relevant is simple. Hardware realities haven’t changed much:
- CPU cache sizes are the same - L1 cache is still ~32-64 KB per core
- Memory latency is still the bottleneck - RAM hasn’t gotten faster relative to CPU
- Multi-core is standard - but Python’s GIL still blocks most threading; this maybe changing fast
NumExpr’s core insight - process small chunks that fit in cache, across multiple threads - is just as valid today as it was in 2007. Memory hierarchy hasn’t changed, even if everything else has.
Meanwhile, newer tools like Polars optimize for different things: zero-copy operations, SIMD vectorization, query planning. These are orthogonal to what NumExpr does.
Should you use NumExpr?
Here’s my heuristic:
Use NumExpr when:
- Working with large arrays (> 1M elements)
- Complex numerical expressions (more than 2 operations)
- Memory is constrained
- You’re already in NumPy ecosystem
Don’t bother when:
- Arrays are small (< 100K elements)
- Simple operations (
a + b) - Using Pandas/Polars for structural operations
- Expressions change frequently (compilation overhead)
Use Pandas eval/query instead when:
- Working with DataFrames (not raw arrays)
- Want cleaner syntax than string expressions
- Already using Pandas
The nice thing about NumExpr is you don’t have to go all-in. It’s easy to sprinkle into existing NumPy code:
# before
result = 2*a**2 + 3*b**2 - 4*c
# after
import numexpr as ne
result = ne.evaluate("2*a**2 + 3*b**2 - 4*c")Same result, 4x faster, minimal code changes.
The Intel MKL (and AMD) quirks
NumExpr can optionally use Intel’s Math Kernel Library (MKL) for transcendental functions like sin(), exp(), log().
With MKL, expressions like sin(a) + exp(b) can be 15x faster than NumPy. Without MKL, the speedup is more modest (2-4x). I’m using an AMD R5500, so NumExpr won’t be squeezing performance on my Zen 3 hardware. You can try an old trick to unlock a performance boost.
# force MKL to use AVX2 optimizations on AMD Zen chips
export MKL_DEBUG_CPU_TYPE=5Even on Intel, conda builds of NumExpr typically include MKL on x86, but pip installs don’t. It’s one of the annoying asymmetries in Python’s numerical stack where the performance depends on how you installed things and what hardware you’re running on.
Closing thoughts
NumExpr is one of those libraries that makes you realize how much low-hanging performance fruit exists in typical Python code. We often assume NumPy is “as fast as it gets” for array operations, but that’s only true if you ignore memory access patterns.
The broader lesson is that performance optimization often comes down to working with hardware realities rather than against them. NumExpr doesn’t use fancy new CPU instructions or exotic algorithms. It just:
- Avoids unnecessary memory allocations
- Keeps data in cache when possible
- Uses multiple cores effectively
These principles applied in 2007, and they still apply today.
Next time you’re profiling a data pipeline and see NumPy operations as the bottleneck, consider giving NumExpr a try.
And if you’re a heavy Pandas user, you’re probably already using it without knowing. Maybe it’s time to use it intentionally.
Footnotes:
-
NumExpr is built into Pandas and is used automatically for
eval()andquery()operations. You don’t need to install it separately if you already have Pandas. -
The 4,096 element chunk size is configurable via
ne.set_vml_num_threads()but the default works well for most cases. -
The “15x faster” claim refers to specific operations (complex expressions with transcendental functions on large arrays). Typical speedups are 2-4x for most operations.
-
NumExpr’s virtual machine is written entirely in C and bypasses Python’s GIL for multi-threading. This is why it can effectively use all CPU cores, unlike most Python code.