CPU Backend
The CPU backend is the default and most mature backend in Tensor Frame. It provides reliable tensor operations using system memory and CPU cores, with parallelization via the Rayon library.
Features
- Always Available: No additional dependencies required
- Parallel Processing: Multi-threaded operations via Rayon
- Full API Support: All tensor operations implemented
- Memory Efficient: Direct Vec
storage without additional overhead - Debugging Friendly: Easy inspection with standard debugging tools
Configuration
The CPU backend is enabled by default:
[dependencies]
tensor_frame = "0.0.1-alpha" # CPU backend included
Or explicitly:
[dependencies]
tensor_frame = { version = "0.0.1-alpha", features = ["cpu"] }
Implementation Details
Storage
CPU tensors use standard Rust Vec<f32>
for data storage:
#![allow(unused)] fn main() { pub enum Storage { Cpu(Vec<f32>), // Direct vector storage // ... } }
This provides:
- Memory Layout: Contiguous, row-major (C-style) layout
- Access: Direct memory access without marshaling overhead
- Debugging: Easy inspection with standard Rust tools
Parallelization
The CPU backend uses Rayon for data-parallel operations:
#![allow(unused)] fn main() { // Element-wise operations are parallelized a.par_iter() .zip(b.par_iter()) .map(|(a, b)| a + b) .collect() }
Thread Pool: Rayon automatically manages a global thread pool sized to the number of CPU cores.
Granularity: Operations are automatically chunked for optimal parallel efficiency.
Performance Characteristics
Strengths
- Low Latency: Minimal overhead for small operations
- Predictable: Performance scales linearly with data size and core count
- Memory Bandwidth: Efficiently utilizes system memory bandwidth
- Cache Friendly: Good locality for sequential operations
Limitations
- Compute Bound: Limited by CPU ALU throughput
- Memory Bound: Large operations limited by RAM bandwidth
- Thread Overhead: Parallel overhead dominates for small tensors
Performance Guidelines
Optimal Use Cases
#![allow(unused)] fn main() { // Small to medium tensors (< 10K elements) let small = Tensor::ones(vec![100, 100])?; // Scalar reductions let sum = large_tensor.sum(None)?; // Development and prototyping let test_tensor = Tensor::from_vec(test_data, shape)?; }
Suboptimal Use Cases
#![allow(unused)] fn main() { // Very large tensor operations let huge_op = a + b; // Consider GPU for very large tensors // Repeated large element-wise operations for _ in 0..1000 { result = (a.clone() * b.clone())?; // GPU would be faster } }
Memory Management
Allocation
CPU tensors allocate memory directly from the system heap:
#![allow(unused)] fn main() { let tensor = Tensor::zeros(vec![1000, 1000])?; // Allocates 4MB }
Reference Counting
Tensors use Arc<Vec<f32>>
internally for efficient cloning:
#![allow(unused)] fn main() { let tensor1 = Tensor::ones(vec![1000])?; let tensor2 = tensor1.clone(); // O(1) reference count increment // Memory shared until one tensor is modified (copy-on-write semantics) }
Memory Usage
Monitor memory usage with standard system tools:
# Linux
cat /proc/meminfo
# macOS
vm_stat
# Windows
wmic OS get TotalVisibleMemorySize,FreePhysicalMemory
Debugging and Profiling
Tensor Inspection
CPU tensors are easy to inspect:
#![allow(unused)] fn main() { let tensor = Tensor::from_vec(vec![1.0, 2.0, 3.0, 4.0], vec![2, 2])?; // Direct access to underlying data let data = tensor.to_vec()?; println!("Raw data: {:?}", data); // Shape information println!("Shape: {:?}", tensor.shape().dims()); println!("Elements: {}", tensor.numel()); }
Performance Profiling
Use standard Rust profiling tools:
#![allow(unused)] fn main() { // Add timing use std::time::Instant; let start = Instant::now(); let result = large_tensor.sum(None)?; println!("CPU operation took: {:?}", start.elapsed()); }
For detailed profiling:
# Install flamegraph
cargo install flamegraph
# Profile your application
cargo flamegraph --bin your_app
Thread Analysis
Monitor Rayon thread usage:
#![allow(unused)] fn main() { // Check thread pool size println!("Rayon threads: {}", rayon::current_num_threads()); // Custom thread pool let pool = rayon::ThreadPoolBuilder::new() .num_threads(4) .build()?; pool.install(|| { // Operations here use 4 threads max let result = tensor1 + tensor2; }); }
Error Handling
CPU backend errors are typically related to memory allocation:
#![allow(unused)] fn main() { use tensor_frame::{Tensor, TensorError}; match Tensor::zeros(vec![100000, 100000]) { Ok(tensor) => { // Success - 40GB allocated } Err(TensorError::BackendError(msg)) => { // Likely out of memory eprintln!("CPU backend error: {}", msg); } Err(e) => { eprintln!("Other error: {}", e); } } }
Common Error Conditions:
- Out of Memory: Requesting more memory than available
- Integer Overflow: Tensor dimensions too large for address space
- Thread Panic: Rayon worker thread panics (rare)
Optimization Tips
Memory Layout Optimization
#![allow(unused)] fn main() { // Prefer contiguous operations let result = (a + b) * c; // Better than separate operations // Avoid unnecessary allocations let result = a.clone() + b; // Creates temporary clone let result = &a + &b; // Better - uses references }
Parallel Operation Tuning
#![allow(unused)] fn main() { // For very small tensors, disable parallelism let small_result = small_a + small_b; // Rayon decides automatically // For custom control rayon::ThreadPoolBuilder::new() .num_threads(1) // Force single-threaded .build_global()?; }
Cache Optimization
#![allow(unused)] fn main() { // Process data in blocks for better cache usage for chunk in tensor.chunks(cache_friendly_size) { // Process chunk } // Transpose cache-friendly let transposed = matrix.transpose()?; // May benefit from blocking }
Integration with Other Libraries
NumPy Compatibility
#![allow(unused)] fn main() { // Convert to/from Vec for NumPy interop let tensor = Tensor::from_vec(numpy_data, shape)?; let back_to_numpy = tensor.to_vec()?; }
ndarray Integration
#![allow(unused)] fn main() { use ndarray::Array2; // Convert from ndarray let nd_array = Array2::from_shape_vec((2, 2), vec![1.0, 2.0, 3.0, 4.0])?; let tensor = Tensor::from_vec(nd_array.into_raw_vec(), vec![2, 2])?; // Convert to ndarray let data = tensor.to_vec()?; let shape = tensor.shape().dims(); let nd_array = Array2::from_shape_vec((shape[0], shape[1]), data)?; }
BLAS Integration
For maximum performance, consider linking with optimized BLAS:
[dependencies]
tensor_frame = "0.0.1-alpha"
blas-src = { version = "0.8", features = ["openblas"] }
This can significantly speed up matrix operations on the CPU backend.