Performance Guide
This guide provides detailed information on optimizing Tensor Frame performance across different backends and use cases.
Performance Overview
Tensor Frame's performance characteristics vary significantly based on:
- Tensor size: Small vs large tensors have different optimal backends
- Operation type: Element-wise vs reductions vs matrix operations
- Backend selection: CPU vs WGPU vs CUDA performance profiles
- Memory patterns: Data locality and transfer overhead
Backend Performance Characteristics
CPU Backend
- Best for: Small tensors (< 10K elements), development, guaranteed availability
- Strengths: Low latency, no setup overhead, excellent debugging
- Limitations: Limited parallelism, memory bandwidth bound for large operations
#![allow(unused)] fn main() { use tensor_frame::Tensor; // CPU optimal: Small tensors and scalar operations let small = Tensor::ones(vec![100, 100])?; let result = small.sum(None)?; // ~0.1ms on modern CPU }
WGPU Backend
- Best for: Large element-wise operations (> 100K elements), cross-platform deployment
- Strengths: Massive parallelism, good memory bandwidth, portable
- Limitations: GPU setup overhead (~1-10ms), limited operation support
#![allow(unused)] fn main() { use tensor_frame::Tensor; // WGPU optimal: Large parallel operations let large = Tensor::ones(vec![2048, 2048])? .to_backend(BackendType::Wgpu)?; let result = (large_a * large_b) + large_c; // ~2ms on modern GPU }
CUDA Backend
- Best for: Very large operations (> 1M elements), production workloads
- Strengths: Peak performance, mature optimizations, cuBLAS integration
- Limitations: NVIDIA-only, CUDA toolkit requirement
#![allow(unused)] fn main() { use tensor_frame::Tensor; // CUDA optimal: Matrix operations and very large tensors let matrices = Tensor::ones(vec![4096, 4096])? .to_backend(BackendType::Cuda)?; let result = matrix_a.matmul(&matrix_b)?; // ~15ms with cuBLAS }
Operation-Specific Performance
Element-wise Operations
Performance Scaling:
- CPU: O(n) with thread-level parallelism (8-32 threads)
- WGPU: O(n) with massive parallelism (1000+ threads)
- CUDA: O(n) with optimal parallelism (10000+ threads)
#![allow(unused)] fn main() { use std::time::Instant; fn benchmark_element_wise() -> Result<()> { let sizes = vec![1000, 5000, 10000, 50000]; for size in sizes { let a = Tensor::ones(vec![size, size])?; let b = Tensor::ones(vec![size, size])?; // CPU timing let start = Instant::now(); let cpu_result = &a + &b; let cpu_time = start.elapsed(); // GPU timing (if available) #[cfg(feature = "wgpu")] { let gpu_a = a.to_backend(BackendType::Wgpu)?; let gpu_b = b.to_backend(BackendType::Wgpu)?; let start = Instant::now(); let gpu_result = &gpu_a + &gpu_b; let _sync = gpu_result.to_vec()?; let gpu_time = start.elapsed(); let speedup = cpu_time.as_nanos() as f64 / gpu_time.as_nanos() as f64; println!("Size {}x{}: CPU {:?}, GPU {:?}, Speedup: {:.1}x", size, size, cpu_time, gpu_time, speedup); } } Ok(()) } }
Reduction Operations
Performance Notes:
- CPU: Rayon parallel reduction, cache-efficient
- GPU: Requires multiple kernel launches for large reductions
- Memory-bound for large tensors
#![allow(unused)] fn main() { fn reduction_performance() -> Result<()> { let tensor = Tensor::ones(vec![10000, 10000])?; // 100M elements // Sum reduction timing let start = Instant::now(); let sum = tensor.sum(None)?; let cpu_time = start.elapsed(); println!("CPU sum reduction (100M elements): {:?}", cpu_time); println!("Result: {}", sum.to_vec()?[0]); Ok(()) } }
Memory Performance
Memory Transfer Costs
GPU operations include memory transfer overhead:
#![allow(unused)] fn main() { fn memory_transfer_analysis() -> Result<()> { let sizes = vec![1000, 5000, 10000]; for size in sizes { let tensor = Tensor::ones(vec![size, size])?; let elements = tensor.numel(); let bytes = elements * 4; // f32 = 4 bytes #[cfg(feature = "wgpu")] { // Time conversion to GPU let start = Instant::now(); let gpu_tensor = tensor.to_backend(BackendType::Wgpu)?; let upload_time = start.elapsed(); // Time conversion back to CPU let start = Instant::now(); let _data = gpu_tensor.to_vec()?; let download_time = start.elapsed(); let upload_bw = bytes as f64 / upload_time.as_secs_f64() / 1e9; // GB/s let download_bw = bytes as f64 / download_time.as_secs_f64() / 1e9; // GB/s println!("Size {}x{} ({} MB):", size, size, bytes / 1024 / 1024); println!(" Upload: {:?} ({:.1} GB/s)", upload_time, upload_bw); println!(" Download: {:?} ({:.1} GB/s)", download_time, download_bw); } } Ok(()) } }
Memory Layout Optimization
#![allow(unused)] fn main() { // Efficient: Contiguous memory access let matrix = Tensor::from_vec(data, vec![rows, cols])?; let transposed = matrix.transpose()?; // May require memory copy // Efficient: Operations that preserve layout let result = (&matrix_a + &matrix_b) * 2.0; // All operations maintain layout // Less efficient: Operations that break layout let reshaped = matrix.reshape(vec![cols, rows])?; // May require copy }
Optimization Strategies
1. Backend Selection Strategy
#![allow(unused)] fn main() { fn optimal_backend_for_workload(tensor_size: usize, operation: &str) -> BackendType { match (tensor_size, operation) { // Small tensors: CPU always optimal (0..=10_000, _) => BackendType::Cpu, // Large reductions: Prefer CUDA (_, "reduction") if tensor_size > 1_000_000 => { #[cfg(feature = "cuda")] { BackendType::Cuda } #[cfg(not(feature = "cuda"))] { BackendType::Cpu } } // Large element-wise: GPU beneficial (10_001..=1_000_000, "elementwise") => { #[cfg(feature = "wgpu")] { BackendType::Wgpu } #[cfg(not(feature = "wgpu"))] { BackendType::Cpu } } // Very large: Prefer CUDA > WGPU > CPU (1_000_001.., _) => { #[cfg(feature = "cuda")] { BackendType::Cuda } #[cfg(all(feature = "wgpu", not(feature = "cuda")))] { BackendType::Wgpu } #[cfg(all(not(feature = "wgpu"), not(feature = "cuda")))] { BackendType::Cpu } } // Default: CPU _ => BackendType::Cpu, } } }
2. Operation Fusion
#![allow(unused)] fn main() { // Efficient: Fused operations let result = ((a * b) + c) / d; // Single expression, potential fusion // Less efficient: Separate operations let temp1 = a * b; let temp2 = temp1 + c; let result = temp2 / d; // Multiple temporary allocations }
3. Batch Processing
#![allow(unused)] fn main() { fn efficient_batch_processing(batches: Vec<Tensor>) -> Result<Vec<Tensor>> { // Convert all to same backend once let backend = BackendType::Wgpu; let gpu_batches: Result<Vec<_>> = batches .into_iter() .map(|t| t.to_backend(backend)) .collect(); // Process on GPU gpu_batches? .into_iter() .map(|batch| { // Heavy computation on GPU (batch * 2.0) + 1.0 }) .collect() } }
4. Memory Pool Usage
#![allow(unused)] fn main() { // Efficient: Reuse similar-sized tensors struct TensorPool { cached_tensors: HashMap<Vec<usize>, Vec<Tensor>>, } impl TensorPool { fn get_or_create(&mut self, shape: Vec<usize>) -> Result<Tensor> { if let Some(cached) = self.cached_tensors.get_mut(&shape) { if let Some(tensor) = cached.pop() { return Ok(tensor); } } // Create new tensor if no cached version Tensor::zeros(shape) } fn return_tensor(&mut self, tensor: Tensor) { let shape = tensor.shape().dims().to_vec(); self.cached_tensors .entry(shape) .or_insert_with(Vec::new) .push(tensor); } } }
Profiling and Debugging
CPU Profiling
#![allow(unused)] fn main() { // Use built-in timing use std::time::Instant; let start = Instant::now(); let result = expensive_operation()?; println!("Operation took: {:?}", start.elapsed()); // Use external profilers // cargo install flamegraph // cargo flamegraph --bin your_app }
GPU Profiling
NVIDIA Tools (for CUDA backend):
# Nsight Systems for timeline analysis
nsys profile --stats=true ./your_app
# Nsight Compute for kernel analysis
ncu --metrics sm__throughput.avg.pct_of_peak_sustained_elapsed ./your_app
Platform Tools (for WGPU backend):
- Windows: PIX for Windows, RenderDoc
- macOS: Xcode Instruments (GPU Timeline)
- Linux: RenderDoc, Vulkan Tools
Memory Profiling
#![allow(unused)] fn main() { fn memory_usage_analysis() -> Result<()> { use std::alloc::{GlobalAlloc, Layout, System}; // Monitor system memory usage #[cfg(target_os = "linux")] { use std::fs; let status = fs::read_to_string("/proc/self/status")?; for line in status.lines() { if line.starts_with("VmRSS:") { println!("Memory usage: {}", line); } } } // GPU memory monitoring (platform-specific) #[cfg(feature = "cuda")] { // CUDA memory info let (free, total) = cuda::memory_info()?; println!("GPU memory: {} MB free of {} MB total", free / 1024 / 1024, total / 1024 / 1024); } Ok(()) } }
Performance Benchmarking
Comprehensive Benchmark Suite
#![allow(unused)] fn main() { use criterion::{criterion_group, criterion_main, Criterion}; fn bench_tensor_operations(c: &mut Criterion) { let sizes = vec![100, 500, 1000, 2000]; for size in sizes { let a = Tensor::ones(vec![size, size]).unwrap(); let b = Tensor::ones(vec![size, size]).unwrap(); // CPU benchmark c.bench_function(&format!("cpu_add_{}x{}", size, size), |bench| { bench.iter(|| { let _result = &a + &b; }); }); // GPU benchmark (if available) #[cfg(feature = "wgpu")] { let gpu_a = a.to_backend(BackendType::Wgpu).unwrap(); let gpu_b = b.to_backend(BackendType::Wgpu).unwrap(); c.bench_function(&format!("gpu_add_{}x{}", size, size), |bench| { bench.iter(|| { let result = &gpu_a + &gpu_b; let _sync = result.to_vec().unwrap(); // Force sync }); }); } } } criterion_group!(benches, bench_tensor_operations); criterion_main!(benches); }
Performance Troubleshooting
Common Performance Issues
- Small Tensors on GPU
#![allow(unused)] fn main() { // Problem: GPU overhead for small operations let small = Tensor::ones(vec![10, 10])?; let slow = small.to_backend(BackendType::Wgpu)?; // Overhead > computation // Solution: Use CPU for small tensors let fast = small; // Stay on CPU }
- Frequent Backend Conversions
#![allow(unused)] fn main() { // Problem: Repeated conversions for i in 0..1000 { let gpu_tensor = cpu_tensor.to_backend(BackendType::Wgpu)?; let result = gpu_tensor + 1.0; let back_to_cpu = result.to_backend(BackendType::Cpu)?; } // Solution: Convert once let gpu_tensor = cpu_tensor.to_backend(BackendType::Wgpu)?; for i in 0..1000 { gpu_tensor = gpu_tensor + 1.0; // Stay on GPU } let final_result = gpu_tensor.to_backend(BackendType::Cpu)?; }
- Memory Fragmentation
#![allow(unused)] fn main() { // Problem: Large temporary allocations let huge_temp = (huge_a * huge_b) + huge_c; // 3 large tensors in memory // Solution: In-place operations (when available) let result = huge_a.mul_add(&huge_b, &huge_c)?; // Hypothetical in-place op }
Performance Debugging Checklist
- Profile first: Measure before optimizing
- Check backend selection: Ensure optimal backend for workload
- Monitor memory transfers: GPU transfer costs often dominate
- Verify operation fusion: Combine operations when possible
- Consider batch size: Larger batches amortize overhead
- Test different tensor sizes: Performance characteristics vary by size
- Use appropriate data types: f32 vs f64 performance difference
- Monitor memory usage: Avoid memory pressure and swapping
Hardware-Specific Optimization
CPU Optimization
- Use all available cores (Rayon handles this automatically)
- Ensure sufficient memory bandwidth
- Consider NUMA topology for large systems
- Link with optimized BLAS (OpenBLAS, Intel MKL)
GPU Optimization
- Ensure sufficient GPU memory
- Consider tensor sizes that align with GPU architecture
- Use appropriate batch sizes for GPU utilization
- Monitor thermal throttling on mobile/laptop GPUs
Memory Hierarchy
- L1/L2 cache: Small frequently-accessed tensors
- System RAM: Medium tensors and CPU operations
- GPU VRAM: Large tensors for GPU operations
- Storage: Streaming large datasets
Conclusion
Tensor Frame performance optimization requires understanding:
- Workload characteristics: Size, operations, access patterns
- Backend strengths: CPU for small/mixed, GPU for large parallel
- Memory costs: Transfer overhead, allocation patterns
- Platform specifics: Hardware capabilities and limitations
Use profiling tools to guide optimization decisions and always measure performance improvements to ensure they provide real benefits for your specific use case.