Tensor Frame
Tensor Frame is a high-performance, PyTorch-like tensor library for Rust that supports multiple computational backends including CPU (with Rayon), WGPU (for GPU compute), and CUDA.
Features
- Multiple Backends: Automatic backend selection with fallback support
- CPU backend with Rayon for parallel processing
- WGPU backend for cross-platform GPU computing
- CUDA backend for NVIDIA GPU acceleration
- PyTorch-like API: Familiar tensor operations and broadcasting
- Dynamic Tensors: Runtime shape and type flexibility
- Broadcasting Support: Automatic shape broadcasting for operations
- Zero-Copy Operations: Efficient memory management where possible
- Feature Flags: Optional backends via Cargo features
Quick Example
#![allow(unused)] fn main() { use tensor_frame::Tensor; // Create tensors (automatically uses the best available backend) let a = Tensor::ones(vec![2, 3])?; let b = Tensor::zeros(vec![2, 3])?; // Perform operations with automatic broadcasting let c = (a + b)?; let d = c.sum(None)?; // Sum all elements // Convert back to Vec for inspection let result = d.to_vec()?; println!("Result: {:?}", result); }
Backend Priority
By default, Tensor Frame will attempt to use backends in this order:
- CUDA (if available and feature enabled)
- WGPU (if available and feature enabled)
- CPU (always available)
You can also explicitly specify a backend or create custom backend implementations.
Getting Started
Installation
Add Tensor Frame to your Cargo.toml
:
[dependencies]
tensor_frame = "0.0.1-alpha"
Feature Flags
Tensor Frame supports optional backends via feature flags:
[dependencies]
# CPU only (default)
tensor_frame = "0.0.1-alpha"
# With WGPU support
tensor_frame = { version = "0.0.1-alpha", features = ["wgpu"] }
# With CUDA support
tensor_frame = { version = "0.0.1-alpha", features = ["cuda"] }
# All backends
tensor_frame = { version = "0.0.1-alpha", features = ["wgpu", "cuda"] }
Basic Usage
Creating Tensors
use tensor_frame::{Tensor, Result}; fn main() -> Result<()> { // Create tensors with different initialization let zeros = Tensor::zeros(vec![2, 3])?; let ones = Tensor::ones(vec![2, 3])?; let from_data = Tensor::from_vec( vec![1.0, 2.0, 3.0, 4.0], vec![2, 2] )?; // Inspect tensor properties println!("Shape: {:?}", zeros.shape().dims()); println!("Number of elements: {}", zeros.numel()); println!("Number of dimensions: {}", zeros.ndim()); Ok(()) }
Basic Operations
use tensor_frame::{Tensor, Result}; fn main() -> Result<()> { let a = Tensor::from_vec(vec![1.0, 2.0, 3.0, 4.0], vec![2, 2])?; let b = Tensor::from_vec(vec![5.0, 6.0, 7.0, 8.0], vec![2, 2])?; // Element-wise operations let sum = (a.clone() + b.clone())?; let diff = (a.clone() - b.clone())?; let product = (a.clone() * b.clone())?; let quotient = (a / b)?; // Reduction operations let total = sum.sum(None)?; let average = product.mean(None)?; println!("Sum result: {:?}", total.to_vec()?); Ok(()) }
Broadcasting
Tensor Frame supports automatic broadcasting similar to NumPy and PyTorch:
use tensor_frame::{Tensor, Result}; fn main() -> Result<()> { let a = Tensor::ones(vec![2, 1])?; // Shape: [2, 1] let b = Tensor::ones(vec![1, 3])?; // Shape: [1, 3] // Broadcasting: [2, 1] + [1, 3] -> [2, 3] let c = (a + b)?; println!("Result shape: {:?}", c.shape().dims()); Ok(()) }
Tensor Manipulation
use tensor_frame::{Tensor, Result, TensorOps}; fn main() -> Result<()> { let tensor = Tensor::from_vec( vec![1.0, 2.0, 3.0, 4.0, 5.0, 6.0], vec![2, 3] )?; // Reshape let reshaped = tensor.reshape(vec![3, 2])?; // Transpose (2D only for now) let transposed = reshaped.transpose()?; // Squeeze and unsqueeze let squeezed = tensor.squeeze(None)?; let unsqueezed = squeezed.unsqueeze(0)?; Ok(()) }
API Reference
This section provides detailed documentation for all public APIs in Tensor Frame.
Core Types
- Tensor - The main tensor type with all operations
- Backends - Backend trait and implementation details
- Operations - Detailed operation specifications
Key Traits and Enums
TensorOps Trait
The TensorOps
trait defines all tensor manipulation and computation operations:
#![allow(unused)] fn main() { pub trait TensorOps { fn reshape(&self, new_shape: Vec<usize>) -> Result<Tensor>; fn transpose(&self) -> Result<Tensor>; fn squeeze(&self, dim: Option<usize>) -> Result<Tensor>; fn unsqueeze(&self, dim: usize) -> Result<Tensor>; // ... more methods } }
DType Enum
Supported data types:
#![allow(unused)] fn main() { pub enum DType { F32, // 32-bit floating point (default) F64, // 64-bit floating point I32, // 32-bit signed integer U32, // 32-bit unsigned integer } }
BackendType Enum
Available computational backends:
#![allow(unused)] fn main() { pub enum BackendType { Cpu, // CPU backend with Rayon Wgpu, // Cross-platform GPU backend Cuda, // NVIDIA CUDA backend } }
Error Handling
All operations return Result<T>
with TensorError
for comprehensive error handling:
#![allow(unused)] fn main() { pub enum TensorError { ShapeMismatch { expected: Vec<usize>, got: Vec<usize> }, BackendError(String), InvalidOperation(String), DimensionError(String), } }
Memory Management
Tensor Frame uses smart pointers and reference counting for efficient memory management:
- Tensors are cheaply clonable (reference counted)
- Backend storage is automatically managed
- Cross-backend tensor conversion is supported
- Zero-copy operations where possible
Tensor API
The Tensor
struct is the core data structure in Tensor Frame, representing multi-dimensional arrays with automatic backend selection.
Constructor Methods
Basic Constructors
#![allow(unused)] fn main() { // Create tensor filled with zeros pub fn zeros(shape: Vec<usize>) -> Result<Tensor> // Create tensor filled with ones pub fn ones(shape: Vec<usize>) -> Result<Tensor> // Create tensor from Vec data pub fn from_vec(data: Vec<f32>, shape: Vec<usize>) -> Result<Tensor> }
Examples
#![allow(unused)] fn main() { use tensor_frame::Tensor; // 2x3 matrix of zeros let zeros = Tensor::zeros(vec![2, 3])?; // 1D vector of ones let ones = Tensor::ones(vec![5])?; // Create from existing data let data = vec![1.0, 2.0, 3.0, 4.0]; let tensor = Tensor::from_vec(data, vec![2, 2])?; }
Properties
Shape Information
#![allow(unused)] fn main() { // Get tensor shape pub fn shape(&self) -> &Shape // Get number of elements pub fn numel(&self) -> usize // Get number of dimensions pub fn ndim(&self) -> usize }
Data Access
#![allow(unused)] fn main() { // Convert tensor to Vec<f32> pub fn to_vec(&self) -> Result<Vec<f32>> }
Arithmetic Operations
Tensor Frame supports standard arithmetic operations through operator overloading:
Binary Operations
#![allow(unused)] fn main() { // Addition (element-wise) let c = a + b; let c = &a + &b; // Avoid cloning // Subtraction (element-wise) let c = a - b; // Multiplication (element-wise) let c = a * b; // Division (element-wise) let c = a / b; }
Broadcasting Rules
Addition operations automatically broadcast tensors following NumPy/PyTorch rules. Note: Broadcasting is currently only implemented for addition; other operations require matching shapes.
- Dimensions are aligned from the right
- Missing dimensions are treated as size 1
- Dimensions of size 1 are expanded to match
#![allow(unused)] fn main() { let a = Tensor::ones(vec![2, 1, 3])?; // Shape: [2, 1, 3] let b = Tensor::ones(vec![1, 4, 1])?; // Shape: [1, 4, 1] let c = a + b; // Result: [2, 4, 3] }
Tensor Manipulation
Reshaping
#![allow(unused)] fn main() { impl TensorOps for Tensor { // Change tensor shape (must preserve total elements) fn reshape(&self, new_shape: Vec<usize>) -> Result<Tensor>; } }
#![allow(unused)] fn main() { let tensor = Tensor::from_vec(vec![1.0, 2.0, 3.0, 4.0, 5.0, 6.0], vec![2, 3])?; let reshaped = tensor.reshape(vec![3, 2])?; // 2x3 -> 3x2 }
Transposition
#![allow(unused)] fn main() { // Transpose 2D tensor (swap dimensions) fn transpose(&self) -> Result<Tensor>; }
#![allow(unused)] fn main() { let matrix = Tensor::from_vec(vec![1.0, 2.0, 3.0, 4.0], vec![2, 2])?; let transposed = matrix.transpose()?; // [[1,2],[3,4]] -> [[1,3],[2,4]] }
Dimension Manipulation
#![allow(unused)] fn main() { // Remove dimensions of size 1 fn squeeze(&self, dim: Option<usize>) -> Result<Tensor>; // Add dimension of size 1 fn unsqueeze(&self, dim: usize) -> Result<Tensor>; }
#![allow(unused)] fn main() { let tensor = Tensor::ones(vec![1, 3, 1])?; // Shape: [1, 3, 1] let squeezed = tensor.squeeze(None)?; // Shape: [3] let unsqueezed = squeezed.unsqueeze(0)?; // Shape: [1, 3] }
Reduction Operations
Full Reductions
#![allow(unused)] fn main() { // Sum all elements fn sum(&self, axis: Option<usize>) -> Result<Tensor>; // Mean of all elements fn mean(&self, axis: Option<usize>) -> Result<Tensor>; }
#![allow(unused)] fn main() { let tensor = Tensor::from_vec(vec![1.0, 2.0, 3.0, 4.0], vec![2, 2])?; // Sum all elements -> scalar tensor let total = tensor.sum(None)?; // Result: 10.0 // Mean of all elements -> scalar tensor let average = tensor.mean(None)?; // Result: 2.5 }
Axis-specific Reductions
Note: Axis-specific reductions are not yet implemented in the CPU backend. Currently, only full tensor reductions (with axis=None
) are supported.
Display and Debug
Tensors implement comprehensive display formatting:
#![allow(unused)] fn main() { let tensor = Tensor::from_vec(vec![1.0, 2.0, 3.0, 4.0], vec![2, 2])?; println!("{}", tensor); // Output: // Tensor([[1.0000, 2.0000], // [3.0000, 4.0000]], dtype=f32) }
Type Conversions
#![allow(unused)] fn main() { // Convert to Vec for external use let data: Vec<f32> = tensor.to_vec()?; // Clone (cheap - reference counted) let cloned = tensor.clone(); }
Performance Notes
- Cloning: Tensors use reference counting, so cloning is O(1)
- Backend Selection: Operations stay on the same backend when possible
- Memory Layout: Tensors use row-major (C-style) memory layout
- Broadcasting: Zero-copy when possible, falls back to explicit expansion
Backend System
Tensor Frame uses a pluggable backend system that allows tensors to run on different computational devices. This page documents the backend architecture and API.
Backend Trait
All backends implement the Backend
trait:
#![allow(unused)] fn main() { pub trait Backend: Debug + Send + Sync { fn backend_type(&self) -> BackendType; fn is_available(&self) -> bool; // Tensor creation fn zeros(&self, shape: &Shape, dtype: DType) -> Result<Storage>; fn ones(&self, shape: &Shape, dtype: DType) -> Result<Storage>; fn from_slice(&self, data: &[f32], shape: &Shape) -> Result<Storage>; // Arithmetic operations fn add(&self, lhs: &Storage, rhs: &Storage) -> Result<Storage>; fn sub(&self, lhs: &Storage, rhs: &Storage) -> Result<Storage>; fn mul(&self, lhs: &Storage, rhs: &Storage) -> Result<Storage>; fn div(&self, lhs: &Storage, rhs: &Storage) -> Result<Storage>; // Reduction operations fn sum(&self, storage: &Storage, axis: Option<usize>) -> Result<Storage>; fn mean(&self, storage: &Storage, axis: Option<usize>) -> Result<Storage>; // Data access fn to_vec_f32(&self, storage: &Storage) -> Result<Vec<f32>>; } }
Storage Types
Each backend uses a different storage mechanism:
#![allow(unused)] fn main() { pub enum Storage { Cpu(Vec<f32>), // CPU: simple Vec Wgpu(WgpuStorage), // WGPU: GPU buffer Cuda(CudaStorage), // CUDA: device pointer } pub struct WgpuStorage { pub buffer: Arc<wgpu::Buffer>, // WGPU buffer handle } pub struct CudaStorage { pub ptr: *mut f32, // Raw CUDA device pointer pub len: usize, // Buffer length } }
Backend Selection
Automatic Selection
By default, Tensor Frame automatically selects the best available backend:
- CUDA (if available and feature enabled)
- WGPU (if available and feature enabled)
- CPU (always available)
#![allow(unused)] fn main() { // Uses automatic backend selection let tensor = Tensor::zeros(vec![1000, 1000])?; println!("Selected backend: {:?}", tensor.backend_type()); }
Manual Selection
You can also explicitly specify backend priority:
#![allow(unused)] fn main() { use tensor_frame::backend::{set_backend_priority, BackendType}; // Force CPU backend let cpu_backend = set_backend_priority(vec![BackendType::Cpu]); // Prefer WGPU over CUDA let gpu_backend = set_backend_priority(vec![ BackendType::Wgpu, BackendType::Cuda, BackendType::Cpu ]); }
Backend Conversion
Convert tensors between backends:
#![allow(unused)] fn main() { let cpu_tensor = Tensor::ones(vec![100, 100])?; // Convert to GPU backend (if available) let gpu_tensor = cpu_tensor.to_backend(BackendType::Wgpu)?; // Convert back to CPU let back_to_cpu = gpu_tensor.to_backend(BackendType::Cpu)?; }
Performance Characteristics
CPU Backend
- Pros: Always available, good for small tensors, excellent for development
- Cons: Limited parallelism, slower for large operations
- Best for: Tensors < 10K elements, prototyping, fallback option
- Implementation: Uses Rayon for parallel CPU operations
WGPU Backend
- Pros: Cross-platform GPU support, works on Metal/Vulkan/DX12/OpenGL
- Cons: Compute shader overhead, limited by GPU memory
- Best for: Large tensor operations, cross-platform deployment
- Implementation: Compute shaders with buffer storage
CUDA Backend
- Pros: Highest performance on NVIDIA GPUs, mature ecosystem
- Cons: NVIDIA-only, requires CUDA toolkit installation
- Best for: Production workloads on NVIDIA hardware
- Implementation: cuBLAS and custom CUDA kernels
Backend Availability
Check backend availability at runtime:
#![allow(unused)] fn main() { use tensor_frame::backend::{cpu, wgpu, cuda}; // CPU backend is always available println!("CPU available: {}", cpu::CpuBackend::new().is_available()); // Check GPU backends #[cfg(feature = "wgpu")] if let Ok(wgpu_backend) = wgpu::WgpuBackend::new() { println!("WGPU available: {}", wgpu_backend.is_available()); } #[cfg(feature = "cuda")] println!("CUDA available: {}", cuda::is_available()); }
Cross-Backend Operations
Operations between tensors on different backends automatically handle conversion:
#![allow(unused)] fn main() { let cpu_tensor = Tensor::ones(vec![100])?; let gpu_tensor = Tensor::zeros(vec![100])?.to_backend(BackendType::Wgpu)?; // Automatically converts gpu_tensor to CPU backend for the operation let result = cpu_tensor + gpu_tensor; }
Custom Backends
You can implement custom backends by implementing the Backend
trait:
#![allow(unused)] fn main() { #[derive(Debug)] struct MyCustomBackend; impl Backend for MyCustomBackend { fn backend_type(&self) -> BackendType { // Would need to extend BackendType enum BackendType::Custom } fn is_available(&self) -> bool { true // Your availability logic } // Implement all required methods... fn zeros(&self, shape: &Shape, dtype: DType) -> Result<Storage> { // Your implementation } // ... more methods } }
Memory Management
Reference Counting
- Tensors use
Arc<dyn Backend>
for backend sharing - Storage is reference counted within each backend
- Automatic cleanup when last reference is dropped
Cross-Backend Memory
- Converting between backends allocates new memory
- Original data remains valid until all references dropped
- No automatic synchronization between backends
GPU Memory Management
- WGPU backend uses WGPU's automatic memory management
- CUDA backend manually manages device memory with proper cleanup
- Out-of-memory errors are propagated as
TensorError::BackendError
Operations Reference
This page provides detailed specifications for all tensor operations in Tensor Frame.
Arithmetic Operations
Element-wise Binary Operations
Note: Currently, only addition supports automatic broadcasting. Other operations require tensors to have matching shapes.
Addition (+
)
#![allow(unused)] fn main() { fn add(lhs: &Tensor, rhs: &Tensor) -> Result<Tensor> }
Computes element-wise addition: output[i] = lhs[i] + rhs[i]
Broadcasting: Yes
Supported shapes: Any compatible shapes
Error conditions: Shape incompatibility
#![allow(unused)] fn main() { let a = Tensor::ones(vec![2, 3])?; let b = Tensor::ones(vec![2, 3])?; let c = a + b; // All elements = 2.0 }
Subtraction (-
)
#![allow(unused)] fn main() { fn sub(lhs: &Tensor, rhs: &Tensor) -> Result<Tensor> }
Computes element-wise subtraction: output[i] = lhs[i] - rhs[i]
Broadcasting: No (requires matching shapes)
Supported shapes: Must have identical shapes
Error conditions: Shape mismatch
Multiplication (*
)
#![allow(unused)] fn main() { fn mul(lhs: &Tensor, rhs: &Tensor) -> Result<Tensor> }
Computes element-wise multiplication: output[i] = lhs[i] * rhs[i]
Note: This is element-wise multiplication, not matrix multiplication.
Broadcasting: No (requires matching shapes)
Supported shapes: Must have identical shapes
Error conditions: Shape mismatch
Division (/
)
#![allow(unused)] fn main() { fn div(lhs: &Tensor, rhs: &Tensor) -> Result<Tensor> }
Computes element-wise division: output[i] = lhs[i] / rhs[i]
Broadcasting: No (requires matching shapes)
Supported shapes: Must have identical shapes
Error conditions: Shape mismatch, division by zero
Reduction Operations
Sum
#![allow(unused)] fn main() { fn sum(&self, axis: Option<usize>) -> Result<Tensor> }
Computes sum along specified axis or all elements.
Parameters:
axis: None
- Sum all elements, return scalar tensoraxis: Some(i)
- Sum along axisi
, reduce that dimension
Supported shapes: Any
Error conditions: Axis-specific reductions not yet implemented in CPU backend
#![allow(unused)] fn main() { let tensor = Tensor::from_vec(vec![1.0, 2.0, 3.0, 4.0], vec![2, 2])?; // Sum all elements let total = tensor.sum(None)?; // Result: [10.0] (scalar) // Axis-specific sums not yet implemented // These will be available in future versions: // let col_sums = tensor.sum(Some(0))?; // Future: [4.0, 6.0] // let row_sums = tensor.sum(Some(1))?; // Future: [3.0, 7.0] }
Mean
#![allow(unused)] fn main() { fn mean(&self, axis: Option<usize>) -> Result<Tensor> }
Computes arithmetic mean along specified axis or all elements.
Parameters:
axis: None
- Mean of all elements, return scalar tensoraxis: Some(i)
- Mean along axisi
, reduce that dimension
Supported shapes: Any
Error conditions: Axis-specific reductions not yet implemented in CPU backend
#![allow(unused)] fn main() { let tensor = Tensor::from_vec(vec![1.0, 2.0, 3.0, 4.0], vec![2, 2])?; // Mean of all elements let average = tensor.mean(None)?; // Result: [2.5] (scalar) // Axis-specific means not yet implemented // This will be available in future versions: // let col_means = tensor.mean(Some(0))?; // Future: [2.0, 3.0] }
Shape Manipulation
Reshape
#![allow(unused)] fn main() { fn reshape(&self, new_shape: Vec<usize>) -> Result<Tensor> }
Changes tensor shape while preserving total number of elements.
Requirements:
- Product of new_shape must equal
self.numel()
- New shape cannot have zero dimensions
Error conditions:
- Incompatible total elements
- Invalid shape dimensions
#![allow(unused)] fn main() { let tensor = Tensor::from_vec(vec![1.0, 2.0, 3.0, 4.0, 5.0, 6.0], vec![2, 3])?; let reshaped = tensor.reshape(vec![3, 2])?; // 2×3 -> 3×2 let flattened = tensor.reshape(vec![6])?; // 2×3 -> 6×1 }
Transpose
#![allow(unused)] fn main() { fn transpose(&self) -> Result<Tensor> }
Transposes a 2D tensor (swaps dimensions).
Requirements: Tensor must be exactly 2D
Error conditions: Non-2D tensor
#![allow(unused)] fn main() { let matrix = Tensor::from_vec(vec![1.0, 2.0, 3.0, 4.0], vec![2, 2])?; let transposed = matrix.transpose()?; // [[1,2],[3,4]] -> [[1,3],[2,4]] }
Squeeze
#![allow(unused)] fn main() { fn squeeze(&self, dim: Option<usize>) -> Result<Tensor> }
Removes dimensions of size 1.
Parameters:
dim: None
- Remove all dimensions of size 1dim: Some(i)
- Remove dimensioni
only if it has size 1
Error conditions:
- Invalid dimension index
- Trying to squeeze dimension with size > 1
#![allow(unused)] fn main() { let tensor = Tensor::ones(vec![1, 3, 1, 2])?; // Shape: [1, 3, 1, 2] let squeezed = tensor.squeeze(None)?; // Shape: [3, 2] let partial = tensor.squeeze(Some(0))?; // Shape: [3, 1, 2] }
Unsqueeze
#![allow(unused)] fn main() { fn unsqueeze(&self, dim: usize) -> Result<Tensor> }
Adds a dimension of size 1 at the specified position.
Parameters:
dim
- Position to insert new dimension (0 to ndim inclusive)
Error conditions: Invalid dimension index (> ndim)
#![allow(unused)] fn main() { let tensor = Tensor::ones(vec![3, 2])?; // Shape: [3, 2] let unsqueezed = tensor.unsqueeze(0)?; // Shape: [1, 3, 2] let middle = tensor.unsqueeze(1)?; // Shape: [3, 1, 2] let end = tensor.unsqueeze(2)?; // Shape: [3, 2, 1] }
Broadcasting Rules
Tensor Frame follows NumPy/PyTorch broadcasting conventions:
Alignment
Shapes are aligned from the rightmost dimension:
Tensor A: [3, 1, 4]
Tensor B: [2, 4]
Result: [3, 2, 4]
Size 1 Expansion
Dimensions of size 1 are expanded to match:
Tensor A: [3, 1, 4]
Tensor B: [3, 2, 1]
Result: [3, 2, 4]
Missing Dimensions
Missing leading dimensions are treated as size 1:
Tensor A: [5, 3, 2]
Tensor B: [3, 2]
Result: [5, 3, 2]
Incompatible Shapes
These shapes cannot be broadcast:
Tensor A: [3, 4]
Tensor B: [2, 4] # Error: 3 and 2 cannot be broadcast
Performance Notes
Operation Fusion
- Operations on the same backend avoid intermediate allocations when possible
- Sequential reductions can be fused into single kernel calls
Memory Layout
- All tensors use row-major (C-style) memory layout
- Reshape operations are zero-copy when layout permits
- Transpose creates new memory layout
Backend-Specific Optimizations
- CPU: Uses Rayon for parallel element-wise operations
- WGPU: Utilizes compute shaders for parallel GPU execution
- CUDA: Uses custom kernels for all operations
Broadcasting Performance
- Zero-copy broadcasting when one tensor has size-1 dimensions
- Explicit memory expansion fallback for complex broadcasting patterns
- GPU backends optimize broadcasting in compute shaders
Backends Overview
Tensor Frame's backend system provides a pluggable architecture for running tensor operations on different computational devices. This allows the same high-level tensor API to transparently utilize CPU cores, integrated GPUs, discrete GPUs, and specialized accelerators.
Available Backends
Backend | Feature Flag | Availability | Best Use Cases |
---|---|---|---|
CPU | cpu (default) | Always | Small tensors, development, fallback |
WGPU | wgpu | Cross-platform GPU | Large tensors, cross-platform deployment |
CUDA | cuda | NVIDIA GPUs | High-performance production workloads |
Backend Selection Strategy
Automatic Selection (Recommended)
By default, Tensor Frame automatically selects the best available backend using this priority order:
- CUDA - Highest performance on NVIDIA hardware
- WGPU - Cross-platform GPU acceleration
- CPU - Universal fallback
#![allow(unused)] fn main() { use tensor_frame::Tensor; // Automatically uses best available backend let tensor = Tensor::zeros(vec![1000, 1000])?; println!("Using backend: {:?}", tensor.backend_type()); }
Manual Backend Control
For specific requirements, you can control backend selection:
#![allow(unused)] fn main() { use tensor_frame::backend::{set_backend_priority, BackendType}; // Force CPU-only execution let backend = set_backend_priority(vec![BackendType::Cpu]); // Prefer WGPU over CUDA let backend = set_backend_priority(vec![ BackendType::Wgpu, BackendType::Cuda, BackendType::Cpu ]); }
Per-Tensor Backend Conversion
Convert individual tensors between backends:
#![allow(unused)] fn main() { let cpu_tensor = Tensor::ones(vec![100, 100])?; // Move to GPU let gpu_tensor = cpu_tensor.to_backend(BackendType::Wgpu)?; // Move back to CPU let back_to_cpu = gpu_tensor.to_backend(BackendType::Cpu)?; }
Performance Characteristics
CPU Backend
- Latency: Lowest for small operations (< 1ms)
- Throughput: Limited by CPU cores and memory bandwidth
- Memory: System RAM (typically abundant)
- Parallelism: Thread-level via Rayon
- Overhead: Minimal function call overhead
WGPU Backend
- Latency: Higher initialization cost (~1-10ms)
- Throughput: High for large, parallel operations
- Memory: GPU memory (limited but fast)
- Parallelism: Massive thread-level via compute shaders
- Overhead: GPU command submission and synchronization
CUDA Backend
- Latency: Moderate initialization cost (~0.1-1ms)
- Throughput: Highest for supported operations
- Memory: GPU memory with CUDA optimizations
- Parallelism: Optimal GPU utilization via cuBLAS/cuDNN
- Overhead: Minimal with mature driver stack
When to Use Each Backend
CPU Backend
#![allow(unused)] fn main() { // Good for: let small_tensor = Tensor::ones(vec![10, 10])?; // Small tensors let dev_tensor = Tensor::zeros(vec![100])?; // Development/testing let scalar_ops = tensor.sum(None)?; // Scalar results // Avoid for: // - Large matrix multiplications (> 1000x1000) // - Batch operations on many tensors // - Compute-intensive element-wise operations }
WGPU Backend
#![allow(unused)] fn main() { // Good for: let large_tensor = Tensor::zeros(vec![2048, 2048])?; // Large tensors let batch_ops = tensors.iter().map(|t| t * 2.0); // Batch operations let element_wise = (a * b) + c; // Complex element-wise // Consider for: // - Cross-platform deployment // - When CUDA is not available // - Mixed CPU/GPU workloads }
CUDA Backend
#![allow(unused)] fn main() { // Excellent for: let huge_tensor = Tensor::zeros(vec![4096, 4096])?; // Very large tensors let matrix_mul = a.matmul(&b)?; // Matrix operations let ml_workload = model.forward(input)?; // ML training/inference // Best when: // - NVIDIA GPU available // - Performance is critical // - Using alongside other CUDA libraries }
Cross-Backend Operations
Operations between tensors on different backends automatically handle conversion:
#![allow(unused)] fn main() { let cpu_a = Tensor::ones(vec![1000])?; let gpu_b = Tensor::zeros(vec![1000])?.to_backend(BackendType::Wgpu)?; // Automatically converts to common backend let result = cpu_a + gpu_b; // Runs on CPU backend }
Conversion Rules:
- If backends match, operation runs on that backend
- If backends differ, converts to the "lower priority" backend
- Priority order: CPU > WGPU > CUDA (CPU is most compatible)
Memory Management
Reference Counting
All backends use reference counting for efficient memory management:
#![allow(unused)] fn main() { let tensor1 = Tensor::ones(vec![1000, 1000])?; let tensor2 = tensor1.clone(); // O(1) - just increments reference count // Memory freed automatically when last reference dropped }
Cross-Backend Memory
Converting between backends allocates new memory:
#![allow(unused)] fn main() { let cpu_tensor = Tensor::ones(vec![1000])?; // 4KB CPU memory let gpu_tensor = cpu_tensor.to_backend(BackendType::Wgpu)?; // +4KB GPU memory // Both tensors exist independently until dropped }
Memory Usage Guidelines
- Development: Use CPU backend to avoid GPU memory pressure
- Production: Convert to GPU early, minimize cross-backend copies
- Mixed workloads: Keep frequently-accessed tensors on CPU
- Large datasets: Stream data through GPU backends
Error Handling
Backend operations can fail for various reasons:
#![allow(unused)] fn main() { match Tensor::zeros(vec![100000, 100000]) { Ok(tensor) => println!("Created tensor on {:?}", tensor.backend_type()), Err(TensorError::BackendError(msg)) => { eprintln!("Backend error: {}", msg); // Fallback to smaller size or different backend } Err(e) => eprintln!("Other error: {}", e), } }
Common Error Scenarios:
- GPU Out of Memory: Try smaller tensors or CPU backend
- Backend Unavailable: Fallback to CPU backend
- Feature Not Implemented: Some operations only available on certain backends
- Cross-Backend Type Mismatch: Ensure compatible data types
Backend Implementation Status
Operation | CPU | WGPU | CUDA |
---|---|---|---|
Basic arithmetic (+, -, *, /) | ✅ | ✅ | ✅ |
Reductions (sum, mean) | ✅ | ❌ | ✅ |
Reshape, transpose | ✅ | ✅ | ✅ |
Broadcasting | ✅ | ✅ | ✅ |
✅ = Fully implemented
❌ = Not yet implemented
⚠️ = Partially implemented
CPU Backend
The CPU backend is the default and most mature backend in Tensor Frame. It provides reliable tensor operations using system memory and CPU cores, with parallelization via the Rayon library.
Features
- Always Available: No additional dependencies required
- Parallel Processing: Multi-threaded operations via Rayon
- Full API Support: All tensor operations implemented
- Memory Efficient: Direct Vec
storage without additional overhead - Debugging Friendly: Easy inspection with standard debugging tools
Configuration
The CPU backend is enabled by default:
[dependencies]
tensor_frame = "0.0.1-alpha" # CPU backend included
Or explicitly:
[dependencies]
tensor_frame = { version = "0.0.1-alpha", features = ["cpu"] }
Implementation Details
Storage
CPU tensors use standard Rust Vec<f32>
for data storage:
#![allow(unused)] fn main() { pub enum Storage { Cpu(Vec<f32>), // Direct vector storage // ... } }
This provides:
- Memory Layout: Contiguous, row-major (C-style) layout
- Access: Direct memory access without marshaling overhead
- Debugging: Easy inspection with standard Rust tools
Parallelization
The CPU backend uses Rayon for data-parallel operations:
#![allow(unused)] fn main() { // Element-wise operations are parallelized a.par_iter() .zip(b.par_iter()) .map(|(a, b)| a + b) .collect() }
Thread Pool: Rayon automatically manages a global thread pool sized to the number of CPU cores.
Granularity: Operations are automatically chunked for optimal parallel efficiency.
Performance Characteristics
Strengths
- Low Latency: Minimal overhead for small operations
- Predictable: Performance scales linearly with data size and core count
- Memory Bandwidth: Efficiently utilizes system memory bandwidth
- Cache Friendly: Good locality for sequential operations
Limitations
- Compute Bound: Limited by CPU ALU throughput
- Memory Bound: Large operations limited by RAM bandwidth
- Thread Overhead: Parallel overhead dominates for small tensors
Performance Guidelines
Optimal Use Cases
#![allow(unused)] fn main() { // Small to medium tensors (< 10K elements) let small = Tensor::ones(vec![100, 100])?; // Scalar reductions let sum = large_tensor.sum(None)?; // Development and prototyping let test_tensor = Tensor::from_vec(test_data, shape)?; }
Suboptimal Use Cases
#![allow(unused)] fn main() { // Very large tensor operations let huge_op = a + b; // Consider GPU for very large tensors // Repeated large element-wise operations for _ in 0..1000 { result = (a.clone() * b.clone())?; // GPU would be faster } }
Memory Management
Allocation
CPU tensors allocate memory directly from the system heap:
#![allow(unused)] fn main() { let tensor = Tensor::zeros(vec![1000, 1000])?; // Allocates 4MB }
Reference Counting
Tensors use Arc<Vec<f32>>
internally for efficient cloning:
#![allow(unused)] fn main() { let tensor1 = Tensor::ones(vec![1000])?; let tensor2 = tensor1.clone(); // O(1) reference count increment // Memory shared until one tensor is modified (copy-on-write semantics) }
Memory Usage
Monitor memory usage with standard system tools:
# Linux
cat /proc/meminfo
# macOS
vm_stat
# Windows
wmic OS get TotalVisibleMemorySize,FreePhysicalMemory
Debugging and Profiling
Tensor Inspection
CPU tensors are easy to inspect:
#![allow(unused)] fn main() { let tensor = Tensor::from_vec(vec![1.0, 2.0, 3.0, 4.0], vec![2, 2])?; // Direct access to underlying data let data = tensor.to_vec()?; println!("Raw data: {:?}", data); // Shape information println!("Shape: {:?}", tensor.shape().dims()); println!("Elements: {}", tensor.numel()); }
Performance Profiling
Use standard Rust profiling tools:
#![allow(unused)] fn main() { // Add timing use std::time::Instant; let start = Instant::now(); let result = large_tensor.sum(None)?; println!("CPU operation took: {:?}", start.elapsed()); }
For detailed profiling:
# Install flamegraph
cargo install flamegraph
# Profile your application
cargo flamegraph --bin your_app
Thread Analysis
Monitor Rayon thread usage:
#![allow(unused)] fn main() { // Check thread pool size println!("Rayon threads: {}", rayon::current_num_threads()); // Custom thread pool let pool = rayon::ThreadPoolBuilder::new() .num_threads(4) .build()?; pool.install(|| { // Operations here use 4 threads max let result = tensor1 + tensor2; }); }
Error Handling
CPU backend errors are typically related to memory allocation:
#![allow(unused)] fn main() { use tensor_frame::{Tensor, TensorError}; match Tensor::zeros(vec![100000, 100000]) { Ok(tensor) => { // Success - 40GB allocated } Err(TensorError::BackendError(msg)) => { // Likely out of memory eprintln!("CPU backend error: {}", msg); } Err(e) => { eprintln!("Other error: {}", e); } } }
Common Error Conditions:
- Out of Memory: Requesting more memory than available
- Integer Overflow: Tensor dimensions too large for address space
- Thread Panic: Rayon worker thread panics (rare)
Optimization Tips
Memory Layout Optimization
#![allow(unused)] fn main() { // Prefer contiguous operations let result = (a + b) * c; // Better than separate operations // Avoid unnecessary allocations let result = a.clone() + b; // Creates temporary clone let result = &a + &b; // Better - uses references }
Parallel Operation Tuning
#![allow(unused)] fn main() { // For very small tensors, disable parallelism let small_result = small_a + small_b; // Rayon decides automatically // For custom control rayon::ThreadPoolBuilder::new() .num_threads(1) // Force single-threaded .build_global()?; }
Cache Optimization
#![allow(unused)] fn main() { // Process data in blocks for better cache usage for chunk in tensor.chunks(cache_friendly_size) { // Process chunk } // Transpose cache-friendly let transposed = matrix.transpose()?; // May benefit from blocking }
Integration with Other Libraries
NumPy Compatibility
#![allow(unused)] fn main() { // Convert to/from Vec for NumPy interop let tensor = Tensor::from_vec(numpy_data, shape)?; let back_to_numpy = tensor.to_vec()?; }
ndarray Integration
#![allow(unused)] fn main() { use ndarray::Array2; // Convert from ndarray let nd_array = Array2::from_shape_vec((2, 2), vec![1.0, 2.0, 3.0, 4.0])?; let tensor = Tensor::from_vec(nd_array.into_raw_vec(), vec![2, 2])?; // Convert to ndarray let data = tensor.to_vec()?; let shape = tensor.shape().dims(); let nd_array = Array2::from_shape_vec((shape[0], shape[1]), data)?; }
BLAS Integration
For maximum performance, consider linking with optimized BLAS:
[dependencies]
tensor_frame = "0.0.1-alpha"
blas-src = { version = "0.8", features = ["openblas"] }
This can significantly speed up matrix operations on the CPU backend.
WGPU Backend
The WGPU backend provides cross-platform GPU compute acceleration using the WebGPU standard. It supports Metal, Vulkan, DirectX 12, and OpenGL backends, making it an excellent choice for portable high-performance computing.
Features
- Cross-Platform: Works on Windows, macOS, Linux, iOS, Android, and Web
- Multiple APIs: Supports Vulkan, Metal, DX12, DX11, OpenGL ES, and WebGL
- Compute Shaders: Uses WGSL (WebGPU Shading Language) for parallel operations
- Memory Efficient: GPU buffer management with automatic cleanup
- Future-Proof: Based on the emerging WebGPU standard
Installation
Enable the WGPU backend with the feature flag:
[dependencies]
tensor_frame = { version = "0.0.1-alpha", features = ["wgpu"] }
Additional Dependencies:
- No platform-specific GPU drivers required
- Uses system graphics drivers (Metal, Vulkan, DirectX, OpenGL)
System Requirements
Minimum Requirements
- GPU: Any GPU with compute shader support
- Driver: Up-to-date graphics drivers
- Memory: Sufficient GPU memory for tensor data
Supported Platforms
Platform | Graphics API | Status |
---|---|---|
Windows | DirectX 12, Vulkan | ✅ Full support |
Windows | DirectX 11 | ✅ Fallback support |
macOS | Metal | ✅ Full support |
Linux | Vulkan | ✅ Full support |
Linux | OpenGL ES | ⚠️ Limited support |
iOS | Metal | ✅ Full support |
Android | Vulkan, OpenGL ES | ✅ Full support |
Web | WebGPU, WebGL2 | ⚠️ Experimental |
Implementation Details
Storage
WGPU tensors use GPU buffers for data storage:
#![allow(unused)] fn main() { pub struct WgpuStorage { pub buffer: Arc<wgpu::Buffer>, // GPU buffer handle } }
Buffer Properties:
- Location: GPU memory (VRAM)
- Layout: Contiguous, row-major layout
- Usage: Storage buffers with compute shader access
- Synchronization: Automatic via command queue
Compute Shaders
Operations are implemented as WGSL compute shaders:
// Element-wise addition shader
@group(0) @binding(0) var<storage, read> input_a: array<f32>;
@group(0) @binding(1) var<storage, read> input_b: array<f32>;
@group(0) @binding(2) var<storage, read_write> output: array<f32>;
@compute @workgroup_size(64)
fn main(@builtin(global_invocation_id) global_id: vec3<u32>) {
let index = global_id.x;
if (index >= arrayLength(&input_a)) {
return;
}
output[index] = input_a[index] + input_b[index];
}
Parallelization
- Workgroups: Operations dispatched in parallel workgroups
- Thread Count: Automatically sized based on tensor dimensions
- GPU Utilization: Optimized for high occupancy on modern GPUs
Performance Characteristics
Strengths
- Massive Parallelism: Thousands of parallel threads
- High Throughput: Excellent for large tensor operations
- Memory Bandwidth: High GPU memory bandwidth utilization
- Compute Density: Specialized compute units for arithmetic operations
Limitations
- Latency: GPU command submission and synchronization overhead
- Memory Transfer: CPU-GPU data transfers can be expensive
- Limited Precision: Currently only supports f32 operations
- Shader Compilation: First-use compilation overhead
Performance Guidelines
Optimal Use Cases
#![allow(unused)] fn main() { // Large tensor operations (> 10K elements) let large = Tensor::zeros(vec![2048, 2048])?; let result = (large_a * large_b) + large_c; // Repeated operations on same-sized tensors for batch in batches { let output = model.forward(batch)?; // Shader programs cached } // Element-wise operations with complex expressions let result = ((a * b) + c).sqrt(); // Single GPU kernel }
Suboptimal Use Cases
#![allow(unused)] fn main() { // Very small tensors let small = Tensor::ones(vec![10, 10])?; // GPU overhead dominates // Frequent CPU-GPU transfers let gpu_tensor = cpu_tensor.to_backend(BackendType::Wgpu)?; let back_to_cpu = gpu_tensor.to_vec()?; // Expensive transfers // Scalar operations let sum = tensor.sum(None)?; // Result copied back to CPU }
Memory Management
GPU Memory Allocation
WGPU automatically manages GPU memory:
#![allow(unused)] fn main() { let tensor = Tensor::zeros(vec![2048, 2048])?; // Allocates ~16MB GPU memory }
Memory Pool: WGPU uses internal memory pools for efficient allocation
Garbage Collection: Buffers automatically freed when last reference dropped
Fragmentation: Large allocations may fail even with sufficient total memory
Memory Transfer Patterns
#![allow(unused)] fn main() { // Efficient: Create on GPU let gpu_tensor = Tensor::zeros(vec![1000, 1000])? .to_backend(BackendType::Wgpu)?; // Inefficient: Frequent transfers let result = cpu_data.to_backend(BackendType::Wgpu)? .sum(None)? .to_backend(BackendType::Cpu)?; // Multiple transfers }
Memory Debugging
Monitor GPU memory usage:
#![allow(unused)] fn main() { // Check GPU memory limits let limits = device.limits(); println!("Max buffer size: {} MB", limits.max_buffer_size / (1024*1024)); // Handle out-of-memory errors match Tensor::zeros(vec![16384, 16384]) { Ok(tensor) => println!("Allocated 1GB GPU tensor"), Err(TensorError::BackendError(msg)) if msg.contains("memory") => { eprintln!("GPU out of memory, trying smaller size"); } Err(e) => eprintln!("Other error: {}", e), } }
Debugging and Profiling
Shader Debugging
WGPU provides validation and debugging features:
#![allow(unused)] fn main() { // Enable validation (debug builds) let instance = wgpu::Instance::new(&wgpu::InstanceDescriptor { backends: wgpu::Backends::all(), flags: wgpu::InstanceFlags::DEBUG | wgpu::InstanceFlags::VALIDATION, ..Default::default() }); }
Performance Profiling
Use GPU profiling tools:
Windows (DirectX):
- PIX for Windows
- RenderDoc
- Visual Studio Graphics Diagnostics
macOS (Metal):
- Xcode Instruments (GPU Timeline)
- Metal System Trace
Linux (Vulkan):
- RenderDoc
- Vulkan Tools
Custom Timing
#![allow(unused)] fn main() { use std::time::Instant; let start = Instant::now(); let result = gpu_tensor_a + gpu_tensor_b; // Note: GPU operations are asynchronous! let _data = result.to_vec()?; // Synchronization point println!("GPU operation took: {:?}", start.elapsed()); }
Error Handling
WGPU backend errors can occur at multiple levels:
Device Creation Errors
#![allow(unused)] fn main() { match WgpuBackend::new() { Ok(backend) => println!("WGPU backend ready"), Err(TensorError::BackendError(msg)) => { eprintln!("WGPU initialization failed: {}", msg); // Fallback to CPU backend } } }
Runtime Errors
#![allow(unused)] fn main() { // Out of GPU memory let result = Tensor::zeros(vec![100000, 100000]); // May fail // Shader compilation errors (rare) let result = custom_operation(tensor); // May fail for invalid shaders // Device lost (driver reset, etc.) let result = tensor.sum(None); // May fail if device is lost }
Common Error Scenarios:
- Device Not Found: No compatible GPU available
- Out of Memory: GPU memory exhausted
- Driver Issues: Outdated or buggy graphics drivers
- Unsupported Operations: Feature not implemented in WGPU backend
Platform-Specific Notes
Windows
- DirectX 12: Best performance and feature support
- Vulkan: Good alternative if DX12 not available
- DirectX 11: Fallback with limited compute support
macOS
- Metal: Excellent native support and performance
- MoltenVK: Vulkan compatibility layer (not recommended for production)
Linux
- Vulkan: Primary choice with best performance
- OpenGL: Fallback with limited compute features
- Graphics Drivers: Ensure latest Mesa/NVIDIA/AMD drivers
Mobile (iOS/Android)
- iOS: Metal provides excellent mobile GPU performance
- Android: Vulkan on newer devices, OpenGL ES fallback
- Power Management: Be aware of thermal throttling
Web (Experimental)
- WebGPU: Emerging standard with excellent performance potential
- WebGL2: Fallback with compute shader emulation
- Browser Support: Chrome/Edge (flag), Firefox (experimental)
Optimization Tips
Workgroup Size Tuning
#![allow(unused)] fn main() { // Optimal workgroup sizes depend on GPU architecture // Current default: 64 threads per workgroup // Nvidia: 32 (warp size) or 64 // AMD: 64 (wavefront size) // Intel: 32 or 64 // Mobile: 16 or 32 }
Batch Operations
#![allow(unused)] fn main() { // Efficient: Batch similar operations let results: Vec<Tensor> = inputs .iter() .map(|input| model.forward(input)) .collect()?; // Inefficient: Individual operations for input in inputs { let result = model.forward(input)?; let cpu_result = result.to_vec()?; // Forces synchronization } }
Memory Layout Optimization
#![allow(unused)] fn main() { // Ensure tensor shapes are GPU-friendly let aligned_size = (size + 63) & !63; // Align to 64-element boundaries let tensor = Tensor::zeros(vec![aligned_size, aligned_size])?; }
Future Developments
The WGPU backend is actively developed with planned improvements:
- Reduction Operations: Sum, mean, and other reductions on GPU
- Advanced Operations: GPU-optimized tensor operations
- Mixed Precision: f16 and bf16 data type support
- Async Operations: Fully asynchronous GPU command queues
- WebGPU Stability: Production-ready web deployment
CUDA Backend
The CUDA backend provides high-performance tensor operations on NVIDIA GPUs using the CUDA toolkit. It offers the highest performance for supported operations and integrates well with the broader CUDA ecosystem.
Features
- Peak Performance: Optimized kernels for maximum NVIDIA GPU utilization
- Optimized Kernels: Hardware-accelerated tensor operations
- Memory Optimization: Efficient GPU memory management
- Mature Ecosystem: Integration with existing CUDA libraries
- Production Ready: Battle-tested in production environments
Installation
Prerequisites
CUDA Toolkit: Install NVIDIA CUDA Toolkit 11.0 or later
- Download from NVIDIA Developer
- Ensure
nvcc
is in your PATH - Verify installation:
nvcc --version
Compatible GPU: NVIDIA GPU with compute capability 3.5+
- Check compatibility:
nvidia-smi
- Verify compute capability:
deviceQuery
(CUDA samples)
Cargo Configuration
Enable the CUDA backend:
[dependencies]
tensor_frame = { version = "0.0.1-alpha", features = ["cuda"] }
Build Requirements:
- CUDA Toolkit installed
- NVIDIA GPU drivers
- C++ compiler (MSVC on Windows, GCC/Clang on Linux)
System Requirements
Hardware
- GPU: NVIDIA GPU with compute capability 3.5+
- Memory: Sufficient GPU memory for tensor operations
- PCIe: PCIe 3.0 x16 recommended for optimal memory bandwidth
Software
- CUDA Toolkit: Version 11.0+ (12.0+ recommended)
- Driver: NVIDIA driver supporting your CUDA version
- OS: Linux (preferred), Windows 10+, WSL2
Verified Configurations
GPU Generation | Compute Capability | CUDA Version | Status |
---|---|---|---|
Maxwell (GTX 900) | 5.0, 5.2 | 11.0+ | ✅ Supported |
Pascal (GTX 10x0) | 6.0, 6.1 | 11.0+ | ✅ Fully supported |
Volta (V100) | 7.0 | 11.0+ | ✅ Optimized |
Turing (RTX 20x0) | 7.5 | 11.0+ | ✅ Optimized |
Ampere (RTX 30x0) | 8.0, 8.6 | 11.2+ | ✅ Optimal |
Ada (RTX 40x0) | 8.9 | 12.0+ | ✅ Latest features |
Implementation Details
Storage
CUDA tensors use device memory pointers:
#![allow(unused)] fn main() { pub struct CudaStorage { pub ptr: *mut f32, // Raw CUDA device pointer pub len: usize, // Buffer length in elements } }
Memory Properties:
- Location: GPU global memory (VRAM)
- Layout: Contiguous, row-major layout
- Alignment: 256-byte aligned for optimal coalescing
- Synchronization: Explicit via CUDA streams
Kernel Implementation
Operations use optimized CUDA kernels:
// Element-wise addition kernel
__global__ void add_kernel(const float* a, const float* b, float* c, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
c[idx] = a[idx] + b[idx];
}
}
Performance Characteristics
Strengths
- Compute Throughput: Maximum FP32/FP16 throughput on NVIDIA hardware
- Memory Bandwidth: Optimal utilization of GPU memory bandwidth
- Kernel Optimization: Hand-tuned kernels for each operation
- Library Integration: Designed for future integration with cuDNN, etc.
Performance Metrics
Example performance on RTX 4090:
Operation | Tensor Size | CPU (32 cores) | CUDA | Speedup |
---|---|---|---|---|
Element-wise Add | 1M elements | 2.1 ms | 0.18 ms | 12x |
Matrix Multiply | 2048x2048 | 450 ms | 8.2 ms | 55x |
Reduction Sum | 16M elements | 15 ms | 0.52 ms | 29x |
Optimization Guidelines
Optimal Use Cases
#![allow(unused)] fn main() { // Large tensor operations let a = Tensor::zeros(vec![4096, 4096])?; let b = Tensor::zeros(vec![4096, 4096])?; let c = (a * b) + 1.0; // Excellent GPU performance // Batch operations for batch in large_dataset { let result = model.forward(batch)?; // Amortizes GPU overhead } // Memory-bound operations let result = ((a * b) + c) / d; // GPU memory bandwidth utilized }
Suboptimal Use Cases
#![allow(unused)] fn main() { // Very small tensors let tiny = Tensor::ones(vec![8, 8])?; // Kernel launch overhead dominates // Frequent host-device transfers let gpu_result = cpu_tensor.to_backend(BackendType::Cuda)?; let back_to_cpu = gpu_result.to_vec()?; // PCIe bandwidth bottleneck // Scalar reductions with immediate use let sum = tensor.sum(None)?.to_vec()?; // Forces synchronization }
Memory Management
Device Memory Allocation
CUDA tensors allocate GPU memory directly:
#![allow(unused)] fn main() { // Allocates 64MB of GPU memory let large_tensor = Tensor::zeros(vec![4096, 4096])? .to_backend(BackendType::Cuda)?; }
Memory Pool Management
The backend uses a memory pool for efficient allocation:
#![allow(unused)] fn main() { // Pool reduces allocation overhead let tensors: Vec<Tensor> = (0..100) .map(|_| Tensor::zeros(vec![1024, 1024])) .collect::<Result<Vec<_>>>()?; }
Memory Transfer Optimization
#![allow(unused)] fn main() { // Efficient: Batch transfers let gpu_tensors = cpu_tensors .into_iter() .map(|t| t.to_backend(BackendType::Cuda)) .collect::<Result<Vec<_>>>()?; // Inefficient: Individual transfers for cpu_tensor in cpu_tensors { let gpu_tensor = cpu_tensor.to_backend(BackendType::Cuda)?; process(gpu_tensor)?; } }
Memory Debugging
Monitor GPU memory usage:
# Check GPU memory
nvidia-smi
# Continuous monitoring
watch -n 1 nvidia-smi
#![allow(unused)] fn main() { // Check available memory let (free, total) = cuda::memory_info()?; println!("GPU memory: {}/{} MB", free / 1024 / 1024, total / 1024 / 1024); // Handle out-of-memory match Tensor::zeros(vec![16384, 16384]).and_then(|t| t.to_backend(BackendType::Cuda)) { Ok(tensor) => println!("Allocated 1GB GPU tensor"), Err(TensorError::BackendError(msg)) if msg.contains("memory") => { eprintln!("GPU OOM, trying smaller allocation"); } Err(e) => eprintln!("CUDA error: {}", e), } }
Error Handling
CUDA operations can fail for various hardware and software reasons:
Runtime Errors
#![allow(unused)] fn main() { use tensor_frame::{Tensor, TensorError}; match tensor_operation() { Ok(result) => process(result), Err(TensorError::BackendError(msg)) => { if msg.contains("out of memory") { // GPU memory exhausted fallback_to_cpu()?; } else if msg.contains("invalid device") { // GPU not available or driver issue retry_with_cpu_backend()?; } else { // Other CUDA error eprintln!("CUDA error: {}", msg); } } } }
Common Error Scenarios
- GPU Out of Memory: Tensor too large for available GPU memory
- Invalid Device: GPU not found or not compatible
- Driver Mismatch: CUDA driver version incompatible
- Kernel Launch Failed: Invalid kernel parameters or GPU fault
- Memory Access Violation: Invalid GPU memory access
Error Recovery
#![allow(unused)] fn main() { // Graceful fallback strategy fn robust_tensor_operation(tensor: Tensor) -> Result<Tensor> { // Try CUDA first if let Ok(cuda_tensor) = tensor.to_backend(BackendType::Cuda) { match cuda_operation(cuda_tensor) { Ok(result) => return Ok(result), Err(TensorError::BackendError(_)) => { // CUDA failed, fall back to CPU eprintln!("CUDA operation failed, falling back to CPU"); } } } // CPU fallback cpu_operation(tensor.to_backend(BackendType::Cpu)?) } }
Debugging and Profiling
CUDA Debugging Tools
NVIDIA Nsight Systems: System-wide performance analysis
nsys profile --stats=true ./your_app
NVIDIA Nsight Compute: Kernel-level profiling
ncu --metrics sm__throughput.avg.pct_of_peak_sustained_elapsed ./your_app
cuda-memcheck: Memory error detection
cuda-memcheck ./your_app
Performance Analysis
#![allow(unused)] fn main() { // GPU timing with CUDA events use std::time::Instant; let start = Instant::now(); let result = gpu_tensor_a.matmul(&gpu_tensor_b)?; // Note: matmul is asynchronous! let _sync = result.to_vec()?; // Force synchronization let elapsed = start.elapsed(); println!("Matrix multiplication took: {:?}", elapsed); }
Memory Leak Detection
#![allow(unused)] fn main() { // Monitor for memory leaks in long-running applications fn check_memory_usage() -> Result<()> { let (free_before, total) = cuda::memory_info()?; // Perform operations { let tensor = Tensor::zeros(vec![1000, 1000])?.to_backend(BackendType::Cuda)?; let result = expensive_operation(tensor)?; } // tensor should be freed here let (free_after, _) = cuda::memory_info()?; if free_after < free_before { eprintln!("Potential memory leak detected!"); eprintln!("Memory delta: {} MB", (free_before - free_after) / 1024 / 1024); } Ok(()) } }
Production Deployment
Docker Configuration
# Use NVIDIA CUDA base image
FROM nvidia/cuda:12.0-devel-ubuntu20.04
# Install Rust
RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
ENV PATH="/root/.cargo/bin:${PATH}"
# Copy and build your application
COPY . /app
WORKDIR /app
RUN cargo build --release --features cuda
# Runtime with CUDA
FROM nvidia/cuda:12.0-runtime-ubuntu20.04
COPY --from=0 /app/target/release/your_app /usr/local/bin/
CMD ["your_app"]
Kubernetes Deployment
apiVersion: v1
kind: Pod
spec:
containers:
- name: tensor-app
image: your-app:latest
resources:
limits:
nvidia.com/gpu: 1
env:
- name: CUDA_VISIBLE_DEVICES
value: "0"
Environment Variables
# Limit GPU memory growth
export CUDA_MEMORY_POOL_TYPE=pool
# Enable GPU timing
export CUDA_LAUNCH_BLOCKING=1
# Select specific GPU
export CUDA_VISIBLE_DEVICES=0
Optimization Best Practices
Memory Access Patterns
#![allow(unused)] fn main() { // Coalesced memory access (efficient) let result = tensor_a + tensor_b; // Sequential element access // Strided access (less efficient) let transposed = tensor.transpose()?; // May require memory reshape }
Kernel Fusion
#![allow(unused)] fn main() { // Fused operations (single kernel launch) let result = ((a * b) + c).relu(); // Ideally fused into one kernel // Separate operations (multiple kernel launches) let temp1 = a * b; let temp2 = temp1 + c; let result = temp2.relu(); // Three separate kernels }
Stream Management
#![allow(unused)] fn main() { // Future: Async operations with CUDA streams // Currently synchronous, but optimizations planned let stream_a = cuda::create_stream()?; let stream_b = cuda::create_stream()?; // Parallel execution on different streams let result_a = tensor_a.sum(None).execute_on(stream_a)?; let result_b = tensor_b.mean(None).execute_on(stream_b)?; }
Integration with CUDA Ecosystem
cuDNN (Future)
Planned integration for neural network operations:
#![allow(unused)] fn main() { // Future: Convolution operations let output = input.conv2d(&kernel, stride, padding)?; }
NCCL (Future)
Multi-GPU communication for distributed computing:
#![allow(unused)] fn main() { // Future: Multi-GPU operations let distributed_result = tensor.all_reduce_sum()?; }
Examples and Tutorials
This section provides practical examples and tutorials for using Tensor Frame effectively. Each example is designed to demonstrate specific features and common usage patterns.
Getting Started Examples
Perfect for newcomers to Tensor Frame:
- Basic Operations - Tensor creation, arithmetic, and basic manipulation
- Broadcasting - Understanding automatic shape broadcasting
- Custom Backends - Working with different computational backends
Example Categories
Fundamental Operations
Learn the core tensor operations that form the foundation of all computational work:
#![allow(unused)] fn main() { // Tensor creation let zeros = Tensor::zeros(vec![3, 4])?; let ones = Tensor::ones(vec![2, 2])?; let data = Tensor::from_vec(vec![1.0, 2.0, 3.0, 4.0], vec![2, 2])?; // Basic arithmetic let sum = a + b; let product = a * b; let result = (a * 2.0) + b; }
Shape Manipulation
Master tensor reshaping and dimension manipulation:
#![allow(unused)] fn main() { // Reshaping and transposition let reshaped = tensor.reshape(vec![4, 3])?; let transposed = matrix.transpose()?; // Dimension manipulation let squeezed = tensor.squeeze(None)?; let unsqueezed = squeezed.unsqueeze(1)?; }
Backend Optimization
Learn when and how to use different computational backends:
#![allow(unused)] fn main() { // Automatic backend selection let tensor = Tensor::zeros(vec![1000, 1000])?; // Manual backend control let gpu_tensor = tensor.to_backend(BackendType::Wgpu)?; let cuda_tensor = tensor.to_backend(BackendType::Cuda)?; }
Running Examples
All examples are located in the examples/
directory of the repository:
# Run basic operations example
cargo run --example basic_operations
# Run with specific backend
cargo run --example basic_operations --features wgpu
cargo run --example basic_operations --features cuda
# Run with all features
cargo run --example basic_operations --features "wgpu,cuda"
Example Structure
Each example follows a consistent structure:
- Setup: Import necessary modules and create test data
- Demonstration: Show the specific feature in action
- Explanation: Detailed comments explaining what's happening
- Performance Notes: Tips for optimal usage
- Error Handling: Proper error handling patterns
Performance Benchmarking
Many examples include performance comparisons:
#![allow(unused)] fn main() { use std::time::Instant; // CPU benchmark let start = Instant::now(); let cpu_result = &cpu_tensor + &cpu_other; let cpu_time = start.elapsed(); // GPU benchmark let start = Instant::now(); let gpu_result = &gpu_tensor + &gpu_other; let _sync = gpu_result.to_vec()?; // Force synchronization let gpu_time = start.elapsed(); println!("CPU: {:?}, GPU: {:?}, Speedup: {:.1}x", cpu_time, gpu_time, cpu_time.as_secs_f64() / gpu_time.as_secs_f64()); }
Interactive Examples
Some examples are designed for interactive exploration:
# Interactive tensor exploration
cargo run --example interactive
# Performance testing with different sizes
cargo run --example benchmark -- --size 1000
cargo run --example benchmark -- --size 2000 --backend cuda
Common Patterns
Error Handling Pattern
#![allow(unused)] fn main() { use tensor_frame::{Tensor, Result, TensorError}; fn robust_operation() -> Result<Tensor> { let tensor = Tensor::zeros(vec![1000, 1000])?; // Try GPU backend first match tensor.to_backend(BackendType::Wgpu) { Ok(gpu_tensor) => { // GPU operations here Ok(expensive_gpu_operation(gpu_tensor)?) } Err(TensorError::BackendError(_)) => { // Fallback to CPU println!("GPU not available, using CPU"); Ok(cpu_operation(tensor)?) } Err(e) => Err(e), } } }
Memory Management Pattern
#![allow(unused)] fn main() { fn memory_efficient_batch_processing(batches: Vec<Vec<f32>>) -> Result<Vec<Tensor>> { let backend = BackendType::Wgpu; // Choose once batches .into_iter() .map(|batch| { let tensor = Tensor::from_vec(batch, vec![batch.len()])?; tensor.to_backend(backend) // Convert once per batch }) .collect() } }
Broadcasting Pattern
#![allow(unused)] fn main() { fn demonstrate_broadcasting() -> Result<()> { // Scalar broadcast let tensor = Tensor::ones(vec![3, 4])?; let scaled = tensor * 2.0; // Scalar broadcasts to all elements // Vector broadcast let matrix = Tensor::ones(vec![3, 4])?; let vector = Tensor::ones(vec![4])?; // Shape: [4] let result = matrix + vector; // Broadcasts to [3, 4] // Matrix broadcast let a = Tensor::ones(vec![3, 1])?; // Shape: [3, 1] let b = Tensor::ones(vec![1, 4])?; // Shape: [1, 4] let result = a + b; // Result: [3, 4] Ok(()) } }
Advanced Examples
For users comfortable with the basics:
Custom Backend Selection
#![allow(unused)] fn main() { fn adaptive_backend_selection(tensor_size: usize) -> BackendType { match tensor_size { 0..=1000 => BackendType::Cpu, // Small: CPU overhead minimal 1001..=100000 => BackendType::Wgpu, // Medium: GPU beneficial _ => BackendType::Cuda, // Large: Maximum performance } } }
Batched Operations
#![allow(unused)] fn main() { fn process_batch_efficiently(inputs: Vec<Tensor>) -> Result<Vec<Tensor>> { // Convert all inputs to same backend let backend = BackendType::Wgpu; let gpu_inputs: Result<Vec<_>> = inputs .into_iter() .map(|t| t.to_backend(backend)) .collect(); // Process on GPU let gpu_outputs: Result<Vec<_>> = gpu_inputs? .into_iter() .map(|input| expensive_operation(input)) .collect(); gpu_outputs } }
Troubleshooting Common Issues
Performance Problems
#![allow(unused)] fn main() { // Problem: Slow operations on small tensors let small = Tensor::ones(vec![10, 10])?; let slow_result = small.to_backend(BackendType::Wgpu)?; // GPU overhead // Solution: Use CPU for small tensors let fast_result = small; // Stay on CPU backend }
Memory Issues
#![allow(unused)] fn main() { // Problem: GPU out of memory match Tensor::zeros(vec![10000, 10000]) { Err(TensorError::BackendError(msg)) if msg.contains("memory") => { // Solution: Use smaller chunks or CPU backend let chunks = create_smaller_chunks()?; process_chunks_individually(chunks)?; } Ok(tensor) => process_large_tensor(tensor)?, Err(e) => return Err(e), } }
Backend Compatibility
#![allow(unused)] fn main() { // Problem: Operation not supported on backend let result = match tensor.backend_type() { BackendType::Wgpu => { // Some operations not yet implemented on WGPU tensor.to_backend(BackendType::Cpu)?.complex_operation()? } _ => tensor.complex_operation()?, }; }
Contributing Examples
We welcome contributions of new examples! Please follow these guidelines:
- Clear Purpose: Each example should demonstrate a specific concept
- Complete Code: Include all necessary imports and error handling
- Documentation: Add detailed comments explaining the concepts
- Performance Notes: Include timing and backend recommendations
- Error Handling: Show proper error handling patterns
See the Contributing Guide for more details on submitting examples.
Basic Operations
This example demonstrates the fundamental tensor operations in Tensor Frame. It covers tensor creation, basic arithmetic, shape manipulation, and data access patterns.
Complete Example
use tensor_frame::{Tensor, Result, TensorOps}; use std::time::Instant; fn main() -> Result<()> { println!("=== Tensor Frame Basic Operations ===\n"); // 1. Tensor Creation tensor_creation_examples()?; // 2. Basic Arithmetic arithmetic_examples()?; // 3. Shape Manipulation shape_manipulation_examples()?; // 4. Data Access data_access_examples()?; // 5. Performance Comparison performance_comparison()?; Ok(()) } /// Demonstrates various ways to create tensors fn tensor_creation_examples() -> Result<()> { println!("=== Tensor Creation ==="); // Create tensor filled with zeros let zeros = Tensor::zeros(vec![2, 3])?; println!("Zeros tensor (2x3):\n{}\n", zeros); // Create tensor filled with ones let ones = Tensor::ones(vec![3, 2])?; println!("Ones tensor (3x2):\n{}\n", ones); // Create tensor from existing data let data = vec![1.0, 2.0, 3.0, 4.0, 5.0, 6.0]; let from_data = Tensor::from_vec(data, vec![2, 3])?; println!("From data (2x3):\n{}\n", from_data); // Check tensor properties println!("Tensor properties:"); println!(" Shape: {:?}", from_data.shape().dims()); println!(" Number of elements: {}", from_data.numel()); println!(" Data type: {:?}", from_data.dtype()); println!(" Backend: {:?}\n", from_data.backend_type()); Ok(()) } /// Demonstrates basic arithmetic operations fn arithmetic_examples() -> Result<()> { println!("=== Arithmetic Operations ==="); // Create test tensors let a = Tensor::from_vec(vec![1.0, 2.0, 3.0, 4.0], vec![2, 2])?; let b = Tensor::from_vec(vec![5.0, 6.0, 7.0, 8.0], vec![2, 2])?; println!("Tensor A:\n{}\n", a); println!("Tensor B:\n{}\n", b); // Element-wise addition let sum = &a + &b; // Use references to avoid moving tensors println!("A + B:\n{}\n", sum); // Element-wise subtraction let diff = &a - &b; println!("A - B:\n{}\n", diff); // Element-wise multiplication let product = &a * &b; println!("A * B (element-wise):\n{}\n", product); // Element-wise division let quotient = &a / &b; println!("A / B:\n{}\n", quotient); // Chained operations let complex = ((&a * 2.0) + &b) / 3.0; println!("(A * 2 + B) / 3:\n{}\n", complex); Ok(()) } /// Demonstrates shape manipulation operations fn shape_manipulation_examples() -> Result<()> { println!("=== Shape Manipulation ==="); // Create a tensor to manipulate let tensor = Tensor::from_vec( vec![1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0], vec![2, 4] )?; println!("Original tensor (2x4):\n{}\n", tensor); // Reshape to different dimensions let reshaped = tensor.reshape(vec![4, 2])?; println!("Reshaped to (4x2):\n{}\n", reshaped); // Reshape to 1D let flattened = tensor.reshape(vec![8])?; println!("Flattened to (8,):\n{}\n", flattened); // Transpose (2D only) let matrix = Tensor::from_vec(vec![1.0, 2.0, 3.0, 4.0], vec![2, 2])?; let transposed = matrix.transpose()?; println!("Original matrix:\n{}\n", matrix); println!("Transposed matrix:\n{}\n", transposed); // Squeeze and unsqueeze let with_ones = Tensor::ones(vec![1, 3, 1])?; println!("Tensor with size-1 dimensions (1x3x1):\n{}\n", with_ones); let squeezed = with_ones.squeeze(None)?; println!("Squeezed (removes all size-1 dims):\n{}\n", squeezed); let unsqueezed = squeezed.unsqueeze(0)?; println!("Unsqueezed at dimension 0:\n{}\n", unsqueezed); Ok(()) } /// Demonstrates data access patterns fn data_access_examples() -> Result<()> { println!("=== Data Access ==="); let tensor = Tensor::from_vec(vec![1.0, 2.0, 3.0, 4.0], vec![2, 2])?; println!("Tensor:\n{}\n", tensor); // Convert to Vec for external use let data = tensor.to_vec()?; println!("As Vec<f32>: {:?}\n", data); // Reduction operations let sum_all = tensor.sum(None)?; println!("Sum of all elements: {}\n", sum_all); let mean_all = tensor.mean(None)?; println!("Mean of all elements: {}\n", mean_all); // Axis-specific reductions let row_sums = tensor.sum(Some(1))?; // Sum along columns (axis 1) println!("Row sums (sum along axis 1): {}\n", row_sums); let col_sums = tensor.sum(Some(0))?; // Sum along rows (axis 0) println!("Column sums (sum along axis 0): {}\n", col_sums); Ok(()) } /// Demonstrates performance characteristics fn performance_comparison() -> Result<()> { println!("=== Performance Comparison ==="); // Small tensor operations (CPU should be faster) let small_a = Tensor::ones(vec![100, 100])?; let small_b = Tensor::ones(vec![100, 100])?; let start = Instant::now(); let result = &small_a + &small_b; let small_time = start.elapsed(); println!("Small tensor (100x100) addition: {:?}", small_time); // Large tensor operations (GPU might be faster if available) let large_a = Tensor::ones(vec![1000, 1000])?; let large_b = Tensor::ones(vec![1000, 1000])?; let start = Instant::now(); let result = &large_a + &large_b; let large_time = start.elapsed(); println!("Large tensor (1000x1000) addition: {:?}", large_time); // Show current backend println!("Current backend: {:?}", result.backend_type()); // Demonstrate backend conversion (if other backends available) #[cfg(feature = "wgpu")] { println!("\n--- WGPU Backend Comparison ---"); let start = Instant::now(); let wgpu_a = large_a.to_backend(tensor_frame::BackendType::Wgpu)?; let wgpu_b = large_b.to_backend(tensor_frame::BackendType::Wgpu)?; let conversion_time = start.elapsed(); let start = Instant::now(); let wgpu_result = &wgpu_a + &wgpu_b; let _sync = wgpu_result.to_vec()?; // Force synchronization let wgpu_time = start.elapsed(); println!("WGPU conversion time: {:?}", conversion_time); println!("WGPU computation time: {:?}", wgpu_time); println!("Total WGPU time: {:?}", conversion_time + wgpu_time); } Ok(()) } /// Advanced patterns demonstration fn advanced_patterns() -> Result<()> { println!("=== Advanced Patterns ==="); // Broadcasting example let matrix = Tensor::ones(vec![3, 4])?; // Shape: [3, 4] let vector = Tensor::ones(vec![4])?; // Shape: [4] let broadcasted = &matrix + &vector; // Result: [3, 4] println!("Matrix (3x4):\n{}\n", matrix); println!("Vector (4,):\n{}\n", vector); println!("Matrix + Vector (broadcasted):\n{}\n", broadcasted); // Complex broadcasting let a = Tensor::ones(vec![2, 1, 3])?; // Shape: [2, 1, 3] let b = Tensor::ones(vec![1, 4, 1])?; // Shape: [1, 4, 1] let complex_broadcast = &a + &b; // Result: [2, 4, 3] println!("Complex broadcasting:"); println!("A shape: {:?}", a.shape().dims()); println!("B shape: {:?}", b.shape().dims()); println!("Result shape: {:?}", complex_broadcast.shape().dims()); // Method chaining let result = Tensor::ones(vec![2, 3])? .reshape(vec![3, 2])? .transpose()?; println!("Method chaining result:\n{}\n", result); Ok(()) } /// Error handling examples fn error_handling_examples() -> Result<()> { println!("=== Error Handling ==="); // Shape mismatch error let a = Tensor::ones(vec![2, 3])?; let b = Tensor::ones(vec![3, 2])?; match &a + &b { Ok(result) => println!("Addition succeeded: {}", result), Err(e) => println!("Expected error - shape mismatch: {}", e), } // Invalid reshape error let tensor = Tensor::ones(vec![2, 3])?; // 6 elements match tensor.reshape(vec![2, 2]) { // 4 elements - invalid! Ok(result) => println!("Reshape succeeded: {}", result), Err(e) => println!("Expected error - invalid reshape: {}", e), } // Out of bounds dimension error match tensor.squeeze(Some(5)) { // Dimension 5 doesn't exist Ok(result) => println!("Squeeze succeeded: {}", result), Err(e) => println!("Expected error - invalid dimension: {}", e), } Ok(()) }
Key Concepts Demonstrated
1. Tensor Creation
Three primary ways to create tensors:
Tensor::zeros(shape)
- Creates tensor filled with zerosTensor::ones(shape)
- Creates tensor filled with onesTensor::from_vec(data, shape)
- Creates tensor from existing data
2. Reference vs. Owned Operations
#![allow(unused)] fn main() { // Moves tensors (can only use once) let result = a + b; // Uses references (can reuse tensors) let result = &a + &b; }
3. Shape Broadcasting
Tensor Frame automatically broadcasts compatible shapes:
#![allow(unused)] fn main() { let matrix = Tensor::ones(vec![3, 4])?; // [3, 4] let vector = Tensor::ones(vec![4])?; // [4] broadcasts to [1, 4] let result = matrix + vector; // Result: [3, 4] }
4. Method Chaining
Operations can be chained for concise code:
#![allow(unused)] fn main() { let result = tensor .reshape(vec![4, 2])? .transpose()? .squeeze(None)?; }
5. Error Handling
All operations return Result<T>
for proper error handling:
#![allow(unused)] fn main() { match risky_operation() { Ok(tensor) => process_tensor(tensor), Err(TensorError::ShapeMismatch { expected, got }) => { eprintln!("Shape error: expected {:?}, got {:?}", expected, got); } Err(e) => eprintln!("Other error: {}", e), } }
Performance Tips
- Use References: Use
&a + &b
instead ofa + b
to avoid unnecessary clones - Batch Operations: Combine operations when possible:
(a * 2.0) + b
vs separate operations - Choose Right Backend: CPU for small tensors, GPU for large operations
- Avoid Frequent Conversions: Stay on one backend when possible
Common Pitfalls
- Shape Mismatches: Ensure compatible shapes for operations
- Invalid Reshapes: New shape must have same total elements
- Backend Overhead: GPU operations have overhead for small tensors
- Memory Usage: Large tensors consume significant memory
Next Steps
After mastering basic operations, explore:
- Broadcasting Examples - Advanced broadcasting patterns
- Backend Selection - Optimizing backend usage
- Performance Guide - Advanced performance optimization
Broadcasting Examples
Broadcasting is one of the most powerful features in Tensor Frame, allowing operations between tensors of different shapes. This guide provides comprehensive examples of broadcasting patterns and best practices.
Broadcasting Rules
Tensor Frame follows NumPy/PyTorch broadcasting rules:
- Alignment: Shapes are compared element-wise from the trailing dimension
- Size 1 Expansion: Dimensions of size 1 are expanded to match
- Missing Dimensions: Missing leading dimensions are treated as size 1
- Compatibility: Dimensions must be either equal, or one must be 1
Basic Broadcasting Examples
Scalar Broadcasting
#![allow(unused)] fn main() { use tensor_frame::{Tensor, Result}; fn scalar_broadcasting() -> Result<()> { // Scalar broadcasts to all elements let tensor = Tensor::from_vec(vec![1.0, 2.0, 3.0, 4.0], vec![2, 2])?; println!("Original tensor:\n{}\n", tensor); // Scalar addition let add_scalar = &tensor + 5.0; println!("Tensor + 5.0:\n{}\n", add_scalar); // Scalar multiplication let mul_scalar = &tensor * 2.0; println!("Tensor * 2.0:\n{}\n", mul_scalar); // Complex scalar operation let complex = (&tensor * 2.0) + 1.0; println!("(Tensor * 2.0) + 1.0:\n{}\n", complex); Ok(()) } }
Vector Broadcasting
#![allow(unused)] fn main() { fn vector_broadcasting() -> Result<()> { // Matrix-vector operations let matrix = Tensor::from_vec( vec![1.0, 2.0, 3.0, 4.0, 5.0, 6.0], vec![2, 3] )?; let vector = Tensor::from_vec(vec![10.0, 20.0, 30.0], vec![3])?; println!("Matrix (2x3):\n{}\n", matrix); println!("Vector (3,):\n{}\n", vector); // Vector broadcasts across matrix rows let result = &matrix + &vector; println!("Matrix + Vector:\n{}\n", result); // Row vector broadcasting let row_vector = Tensor::from_vec(vec![100.0, 200.0, 300.0], vec![1, 3])?; let row_result = &matrix + &row_vector; println!("Matrix + Row Vector (1x3):\n{}\n", row_result); // Column vector broadcasting let col_vector = Tensor::from_vec(vec![10.0, 20.0], vec![2, 1])?; let col_result = &matrix + &col_vector; println!("Matrix + Column Vector (2x1):\n{}\n", col_result); Ok(()) } }
Advanced Broadcasting Patterns
Multi-dimensional Broadcasting
#![allow(unused)] fn main() { fn multidimensional_broadcasting() -> Result<()> { // 3D tensor broadcasting let tensor_3d = Tensor::ones(vec![2, 3, 4])?; // Shape: [2, 3, 4] let tensor_2d = Tensor::ones(vec![3, 4])?; // Shape: [3, 4] let tensor_1d = Tensor::ones(vec![4])?; // Shape: [4] println!("3D tensor shape: {:?}", tensor_3d.shape().dims()); println!("2D tensor shape: {:?}", tensor_2d.shape().dims()); println!("1D tensor shape: {:?}", tensor_1d.shape().dims()); // 3D + 2D broadcasting: [2,3,4] + [3,4] -> [2,3,4] let result_3d_2d = &tensor_3d + &tensor_2d; println!("3D + 2D result shape: {:?}", result_3d_2d.shape().dims()); // 3D + 1D broadcasting: [2,3,4] + [4] -> [2,3,4] let result_3d_1d = &tensor_3d + &tensor_1d; println!("3D + 1D result shape: {:?}", result_3d_1d.shape().dims()); // Complex multi-dimensional broadcasting let a = Tensor::ones(vec![1, 3, 1])?; // Shape: [1, 3, 1] let b = Tensor::ones(vec![2, 1, 4])?; // Shape: [2, 1, 4] let complex_result = &a + &b; // Result: [2, 3, 4] println!("Complex broadcasting:"); println!(" A shape: {:?}", a.shape().dims()); println!(" B shape: {:?}", b.shape().dims()); println!(" Result shape: {:?}", complex_result.shape().dims()); Ok(()) } }
Broadcasting with Size-1 Dimensions
#![allow(unused)] fn main() { fn size_one_broadcasting() -> Result<()> { // Different ways to create broadcastable tensors let base = Tensor::from_vec( vec![1.0, 2.0, 3.0, 4.0, 5.0, 6.0], vec![2, 3] )?; // Row broadcasting (1 x N) let row_broadcast = Tensor::from_vec(vec![10.0, 20.0, 30.0], vec![1, 3])?; let row_result = &base + &row_broadcast; println!("Row broadcasting [2,3] + [1,3]:\n{}\n", row_result); // Column broadcasting (N x 1) let col_broadcast = Tensor::from_vec(vec![100.0, 200.0], vec![2, 1])?; let col_result = &base + &col_broadcast; println!("Column broadcasting [2,3] + [2,1]:\n{}\n", col_result); // Both dimensions broadcast (1 x 1) let scalar_as_tensor = Tensor::from_vec(vec![1000.0], vec![1, 1])?; let scalar_result = &base + &scalar_as_tensor; println!("Scalar broadcasting [2,3] + [1,1]:\n{}\n", scalar_result); Ok(()) } }
Broadcasting in Practice
Machine Learning Patterns
#![allow(unused)] fn main() { fn ml_broadcasting_patterns() -> Result<()> { // Batch normalization pattern let batch_data = Tensor::ones(vec![32, 128])?; // 32 samples, 128 features let mean = Tensor::zeros(vec![128])?; // Feature means let std = Tensor::ones(vec![128])?; // Feature standard deviations // Normalize: (x - mean) / std let normalized = (&batch_data - &mean) / &std; println!("Batch normalization result shape: {:?}", normalized.shape().dims()); // Bias addition pattern let linear_output = Tensor::ones(vec![32, 10])?; // Batch size 32, 10 classes let bias = Tensor::from_vec( vec![0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0], vec![10] )?; let biased_output = &linear_output + &bias; println!("Bias addition result shape: {:?}", biased_output.shape().dims()); // Attention score broadcasting let queries = Tensor::ones(vec![32, 8, 64])?; // [batch, heads, dim] let attention_weights = Tensor::ones(vec![32, 8, 1])?; // [batch, heads, 1] let weighted_queries = &queries * &attention_weights; println!("Attention weighting result shape: {:?}", weighted_queries.shape().dims()); Ok(()) } }
Image Processing Patterns
#![allow(unused)] fn main() { fn image_broadcasting_patterns() -> Result<()> { // Image batch processing let images = Tensor::ones(vec![4, 3, 224, 224])?; // [batch, channels, height, width] // Channel-wise normalization let channel_mean = Tensor::from_vec( vec![0.485, 0.456, 0.406], // ImageNet means vec![1, 3, 1, 1] )?; let channel_std = Tensor::from_vec( vec![0.229, 0.224, 0.225], // ImageNet stds vec![1, 3, 1, 1] )?; let normalized_images = (&images - &channel_mean) / &channel_std; println!("Image normalization result shape: {:?}", normalized_images.shape().dims()); // Pixel-wise operations let brightness_adjustment = Tensor::from_vec(vec![0.1], vec![1, 1, 1, 1])?; let brightened = &images + &brightness_adjustment; println!("Brightness adjustment result shape: {:?}", brightened.shape().dims()); Ok(()) } }
Performance Considerations
Efficient Broadcasting
#![allow(unused)] fn main() { use std::time::Instant; fn broadcasting_performance() -> Result<()> { // Efficient: Broadcasting avoids large intermediate tensors let large_matrix = Tensor::ones(vec![1000, 1000])?; let small_vector = Tensor::ones(vec![1000])?; let start = Instant::now(); let efficient_result = &large_matrix + &small_vector; // Broadcasting let efficient_time = start.elapsed(); println!("Efficient broadcasting: {:?}", efficient_time); // Less efficient: Explicit expansion (don't do this!) let start = Instant::now(); let expanded_vector = small_vector.reshape(vec![1, 1000])?; // Note: This would need manual tiling which isn't implemented // let manual_result = &large_matrix + &expanded_vector; let manual_time = start.elapsed(); println!("Manual expansion overhead: {:?}", manual_time); Ok(()) } }
Memory-Efficient Patterns
#![allow(unused)] fn main() { fn memory_efficient_broadcasting() -> Result<()> { // Good: Broadcasting reuses memory let data = Tensor::ones(vec![1000, 500])?; let scale_factor = Tensor::from_vec(vec![2.0], vec![1])?; let scaled = &data * &scale_factor; // Memory efficient // Avoid: Creating large intermediate tensors // let large_scale = scale_factor.broadcast_to(vec![1000, 500])?; // Wasteful // let scaled = &data * &large_scale; println!("Memory-efficient scaling completed"); Ok(()) } }
Common Broadcasting Errors
Shape Incompatibility
#![allow(unused)] fn main() { fn broadcasting_errors() -> Result<()> { // These will fail - incompatible shapes let a = Tensor::ones(vec![3, 4])?; let b = Tensor::ones(vec![2, 4])?; // Different first dimension, not 1 match &a + &b { Ok(_) => println!("Unexpected success"), Err(e) => println!("Expected error - incompatible shapes: {}", e), } // These will work - compatible shapes let c = Tensor::ones(vec![1, 4])?; // First dimension is 1 let success = &a + &c; println!("Compatible shapes work: {:?}", success.shape().dims()); Ok(()) } }
Broadcasting Visualization
Understanding Shape Alignment
#![allow(unused)] fn main() { fn visualize_broadcasting() -> Result<()> { println!("Broadcasting visualization:"); println!(); // Example 1: [2, 3] + [3] println!("Example 1: [2, 3] + [3]"); println!(" A: [2, 3]"); println!(" B: [3] -> [1, 3] (implicit leading 1)"); println!(" Result: [2, 3]"); println!(); // Example 2: [4, 1, 5] + [3, 5] println!("Example 2: [4, 1, 5] + [3, 5]"); println!(" A: [4, 1, 5]"); println!(" B: [3, 5] -> [1, 3, 5] (implicit leading 1)"); println!(" Result: [4, 3, 5] (1 broadcasts to 3, 4)"); println!(); // Example 3: Incompatible println!("Example 3: [3, 4] + [2, 4] - INCOMPATIBLE"); println!(" A: [3, 4]"); println!(" B: [2, 4]"); println!(" Error: 3 and 2 cannot broadcast (neither is 1)"); println!(); Ok(()) } }
Best Practices
1. Design for Broadcasting
#![allow(unused)] fn main() { // Good: Design tensors with broadcasting in mind let batch_size = 32; let features = 128; let data = Tensor::ones(vec![batch_size, features])?; let weights = Tensor::ones(vec![features])?; // Broadcastable let bias = Tensor::ones(vec![features])?; // Broadcastable let output = (&data * &weights) + &bias; // Clean broadcasting }
2. Use Explicit Shapes
#![allow(unused)] fn main() { // Better: Be explicit about intended broadcasting let matrix = Tensor::ones(vec![10, 20])?; let row_vector = Tensor::ones(vec![1, 20])?; // Explicit [1, 20] let col_vector = Tensor::ones(vec![10, 1])?; // Explicit [10, 1] let row_broadcast = &matrix + &row_vector; let col_broadcast = &matrix + &col_vector; }
3. Document Broadcasting Intent
#![allow(unused)] fn main() { /// Applies per-channel normalization to image batch /// /// # Arguments /// * `images` - Shape [batch, channels, height, width] /// * `channel_stats` - Shape [1, channels, 1, 1] for broadcasting fn normalize_images(images: &Tensor, channel_stats: &Tensor) -> Result<Tensor> { // Broadcasting: [B,C,H,W] - [1,C,1,1] -> [B,C,H,W] images - channel_stats } }
4. Validate Shapes Early
#![allow(unused)] fn main() { fn safe_broadcast_operation(a: &Tensor, b: &Tensor) -> Result<Tensor> { // Check compatibility before expensive operations let a_shape = a.shape().dims(); let b_shape = b.shape().dims(); // Custom validation logic here if !shapes_are_broadcastable(a_shape, b_shape) { return Err(TensorError::ShapeMismatch { expected: a_shape.to_vec(), got: b_shape.to_vec(), }); } // Proceed with operation a + b } fn shapes_are_broadcastable(a: &[usize], b: &[usize]) -> bool { let max_len = a.len().max(b.len()); for i in 0..max_len { let a_dim = a.get(a.len().saturating_sub(max_len - i)).unwrap_or(&1); let b_dim = b.get(b.len().saturating_sub(max_len - i)).unwrap_or(&1); if *a_dim != *b_dim && *a_dim != 1 && *b_dim != 1 { return false; } } true } }
Next Steps
After mastering broadcasting:
- Custom Backends - Optimize broadcasting for different backends
- Performance Guide - Advanced broadcasting optimization
- API Reference - Detailed operation specifications
Custom Backend Examples
This guide demonstrates how to effectively use different computational backends in Tensor Frame, including when to switch backends, performance optimization strategies, and mixed backend workflows.
Backend Selection Strategies
Automatic vs Manual Selection
#![allow(unused)] fn main() { use tensor_frame::{Tensor, BackendType, Result}; use std::time::Instant; fn backend_selection_demo() -> Result<()> { println!("=== Backend Selection Strategies ===\n"); // Automatic selection (recommended for most cases) let auto_tensor = Tensor::zeros(vec![1000, 1000])?; println!("Automatic backend selected: {:?}", auto_tensor.backend_type()); // Manual backend specification let cpu_tensor = auto_tensor.to_backend(BackendType::Cpu)?; println!("Forced CPU backend: {:?}", cpu_tensor.backend_type()); #[cfg(feature = "wgpu")] { match auto_tensor.to_backend(BackendType::Wgpu) { Ok(wgpu_tensor) => { println!("WGPU backend available: {:?}", wgpu_tensor.backend_type()); } Err(e) => { println!("WGPU backend not available: {}", e); } } } #[cfg(feature = "cuda")] { match auto_tensor.to_backend(BackendType::Cuda) { Ok(cuda_tensor) => { println!("CUDA backend available: {:?}", cuda_tensor.backend_type()); } Err(e) => { println!("CUDA backend not available: {}", e); } } } Ok(()) } }
Size-Based Backend Selection
#![allow(unused)] fn main() { fn adaptive_backend_selection() -> Result<()> { println!("=== Adaptive Backend Selection ===\n"); let sizes = vec![ (vec![10, 10], "tiny"), (vec![100, 100], "small"), (vec![1000, 1000], "medium"), (vec![3000, 3000], "large"), ]; for (shape, description) in sizes { let elements = shape.iter().product::<usize>(); // Choose backend based on tensor size let backend = if elements < 1000 { BackendType::Cpu // CPU overhead minimal for small tensors } else if elements < 1_000_000 { // Try WGPU first, fallback to CPU #[cfg(feature = "wgpu")] { BackendType::Wgpu } #[cfg(not(feature = "wgpu"))] { BackendType::Cpu } } else { // Large tensors: prefer CUDA > WGPU > CPU #[cfg(feature = "cuda")] { BackendType::Cuda } #[cfg(all(feature = "wgpu", not(feature = "cuda")))] { BackendType::Wgpu } #[cfg(all(not(feature = "wgpu"), not(feature = "cuda")))] { BackendType::Cpu } }; let tensor = Tensor::zeros(shape.clone())?; let optimized_tensor = tensor.to_backend(backend)?; println!("{} tensor {:?}: {} elements -> {:?} backend", description, shape, elements, optimized_tensor.backend_type()); } Ok(()) } }
Performance Benchmarking
Backend Performance Comparison
#![allow(unused)] fn main() { fn benchmark_backends() -> Result<()> { println!("=== Backend Performance Comparison ===\n"); let sizes = vec![ vec![100, 100], vec![500, 500], vec![1000, 1000], vec![2000, 2000], ]; for size in sizes { println!("Benchmarking {}x{} matrix addition:", size[0], size[1]); // Create test tensors let a = Tensor::ones(size.clone())?; let b = Tensor::ones(size.clone())?; // CPU benchmark let cpu_a = a.to_backend(BackendType::Cpu)?; let cpu_b = b.to_backend(BackendType::Cpu)?; let start = Instant::now(); let cpu_result = &cpu_a + &cpu_b; let cpu_time = start.elapsed(); println!(" CPU: {:?}", cpu_time); // WGPU benchmark (if available) #[cfg(feature = "wgpu")] { match (a.to_backend(BackendType::Wgpu), b.to_backend(BackendType::Wgpu)) { (Ok(wgpu_a), Ok(wgpu_b)) => { let start = Instant::now(); let wgpu_result = &wgpu_a + &wgpu_b; // Force synchronization by converting back let _sync = wgpu_result.to_vec()?; let wgpu_time = start.elapsed(); let speedup = cpu_time.as_nanos() as f64 / wgpu_time.as_nanos() as f64; println!(" WGPU: {:?} ({}x speedup)", wgpu_time, speedup); } _ => println!(" WGPU: Not available"), } } // CUDA benchmark (if available) #[cfg(feature = "cuda")] { match (a.to_backend(BackendType::Cuda), b.to_backend(BackendType::Cuda)) { (Ok(cuda_a), Ok(cuda_b)) => { let start = Instant::now(); let cuda_result = &cuda_a + &cuda_b; let _sync = cuda_result.to_vec()?; let cuda_time = start.elapsed(); let speedup = cpu_time.as_nanos() as f64 / cuda_time.as_nanos() as f64; println!(" CUDA: {:?} ({}x speedup)", cuda_time, speedup); } _ => println!(" CUDA: Not available"), } } println!(); } Ok(()) } }
Operation-Specific Benchmarks
#![allow(unused)] fn main() { fn operation_benchmarks() -> Result<()> { println!("=== Operation-Specific Benchmarks ===\n"); let size = vec![1000, 1000]; let a = Tensor::ones(size.clone())?; let b = Tensor::ones(size.clone())?; // Test different operations let operations = vec![ ("Addition", |a: &Tensor, b: &Tensor| a + b), ("Multiplication", |a: &Tensor, b: &Tensor| a * b), ("Complex", |a: &Tensor, b: &Tensor| (a * 2.0) + b), ]; for (op_name, operation) in operations { println!("Operation: {}", op_name); // CPU timing let cpu_a = a.to_backend(BackendType::Cpu)?; let cpu_b = b.to_backend(BackendType::Cpu)?; let start = Instant::now(); let _cpu_result = operation(&cpu_a, &cpu_b)?; let cpu_time = start.elapsed(); println!(" CPU: {:?}", cpu_time); // GPU timing (if available) #[cfg(feature = "wgpu")] { if let (Ok(gpu_a), Ok(gpu_b)) = ( a.to_backend(BackendType::Wgpu), b.to_backend(BackendType::Wgpu) ) { let start = Instant::now(); let gpu_result = operation(&gpu_a, &gpu_b)?; let _sync = gpu_result.to_vec()?; // Force sync let gpu_time = start.elapsed(); let speedup = cpu_time.as_nanos() as f64 / gpu_time.as_nanos() as f64; println!(" GPU: {:?} ({}x speedup)", gpu_time, speedup); } } println!(); } Ok(()) } }
Mixed Backend Workflows
Pipeline with Backend Transitions
#![allow(unused)] fn main() { fn mixed_backend_pipeline() -> Result<()> { println!("=== Mixed Backend Pipeline ===\n"); // Stage 1: Data preparation on CPU (I/O intensive) println!("Stage 1: Data preparation on CPU"); let raw_data = vec![1.0; 1_000_000]; // Simulate data loading let cpu_tensor = Tensor::from_vec(raw_data, vec![1000, 1000])?; println!(" Created tensor on CPU: {:?}", cpu_tensor.backend_type()); // Stage 2: Heavy computation on GPU #[cfg(feature = "wgpu")] { println!("Stage 2: Moving to GPU for computation"); let gpu_tensor = cpu_tensor.to_backend(BackendType::Wgpu)?; println!(" Moved to GPU: {:?}", gpu_tensor.backend_type()); // Perform heavy computations on GPU let processed = (&gpu_tensor * 2.0) + 1.0; let normalized = &processed / processed.sum(None)?; println!(" Completed GPU computations"); // Stage 3: Results back to CPU for output println!("Stage 3: Moving results back to CPU"); let final_result = normalized.to_backend(BackendType::Cpu)?; println!(" Final result on CPU: {:?}", final_result.backend_type()); // Stage 4: Extract specific values (CPU efficient) let summary = final_result.sum(None)?; println!(" Summary value: {}", summary.to_vec()?[0]); } #[cfg(not(feature = "wgpu"))] { println!("Stage 2-4: Processing on CPU (GPU not available)"); let processed = (&cpu_tensor * 2.0) + 1.0; let summary = processed.sum(None)?; println!(" Summary value: {}", summary.to_vec()?[0]); } Ok(()) } }
Batch Processing Strategy
#![allow(unused)] fn main() { fn batch_processing_strategy() -> Result<()> { println!("=== Batch Processing Strategy ===\n"); // Simulate multiple data batches let batch_sizes = vec![100, 500, 1000, 2000]; for batch_size in batch_sizes { println!("Processing batch size: {}", batch_size); // Create multiple tensors (simulating data batches) let batches: Result<Vec<_>> = (0..5) .map(|i| { let data = vec![i as f32; batch_size * batch_size]; Tensor::from_vec(data, vec![batch_size, batch_size]) }) .collect(); let batches = batches?; // Choose optimal backend based on batch size let backend = if batch_size < 500 { BackendType::Cpu } else { #[cfg(feature = "wgpu")] { BackendType::Wgpu } #[cfg(not(feature = "wgpu"))] { BackendType::Cpu } }; let start = Instant::now(); // Convert all batches to optimal backend let gpu_batches: Result<Vec<_>> = batches .into_iter() .map(|batch| batch.to_backend(backend)) .collect(); let gpu_batches = gpu_batches?; // Process all batches let results: Result<Vec<_>> = gpu_batches .iter() .map(|batch| batch.sum(None)) .collect(); let results = results?; let processing_time = start.elapsed(); println!(" Backend: {:?}", backend); println!(" Processing time: {:?}", processing_time); println!(" Results count: {}", results.len()); println!(); } Ok(()) } }
Error Handling and Fallback Strategies
Robust Backend Selection
#![allow(unused)] fn main() { fn robust_backend_selection(tensor: Tensor) -> Result<Tensor> { // Try backends in order of preference let backends_to_try = vec![ #[cfg(feature = "cuda")] BackendType::Cuda, #[cfg(feature = "wgpu")] BackendType::Wgpu, BackendType::Cpu, ]; for backend in backends_to_try { match tensor.to_backend(backend) { Ok(converted_tensor) => { println!("Successfully using backend: {:?}", backend); return Ok(converted_tensor); } Err(e) => { println!("Backend {:?} failed: {}", backend, e); continue; } } } // This should never happen since CPU should always work Err(tensor_frame::TensorError::BackendError( "No backend available".to_string() )) } fn robust_operation_with_fallback() -> Result<()> { println!("=== Robust Operation with Fallback ===\n"); let large_tensor = Tensor::ones(vec![2000, 2000])?; // Try GPU operation first let result = match large_tensor.to_backend(BackendType::Wgpu) { Ok(gpu_tensor) => { match gpu_tensor.sum(None) { Ok(result) => { println!("GPU operation successful"); result } Err(e) => { println!("GPU operation failed: {}, falling back to CPU", e); large_tensor.to_backend(BackendType::Cpu)?.sum(None)? } } } Err(e) => { println!("GPU conversion failed: {}, using CPU", e); large_tensor.sum(None)? } }; println!("Final result: {}", result.to_vec()?[0]); Ok(()) } }
Memory Management Across Backends
#![allow(unused)] fn main() { fn memory_management_demo() -> Result<()> { println!("=== Memory Management Across Backends ===\n"); // Monitor memory usage pattern let tensor_size = vec![1000, 1000]; // 4MB tensor // Start with CPU let cpu_tensor = Tensor::ones(tensor_size.clone())?; println!("Created tensor on CPU"); // Convert to GPU (allocates GPU memory) #[cfg(feature = "wgpu")] { let gpu_tensor = cpu_tensor.to_backend(BackendType::Wgpu)?; println!("Converted to GPU (both CPU and GPU memory used)"); // Process on GPU let gpu_result = (&gpu_tensor * 2.0) + 1.0; println!("Processed on GPU"); // Convert back to CPU (allocates new CPU memory) let final_result = gpu_result.to_backend(BackendType::Cpu)?; println!("Converted back to CPU"); // At this point: original CPU tensor, GPU tensor, and final CPU tensor exist // Memory is automatically freed when variables go out of scope let summary = final_result.sum(None)?; println!("Final summary: {}", summary.to_vec()?[0]); } println!("Memory automatically freed when variables go out of scope"); Ok(()) } }
Production Patterns
Configuration-Driven Backend Selection
#![allow(unused)] fn main() { use std::env; #[derive(Debug)] struct TensorConfig { preferred_backend: BackendType, fallback_backends: Vec<BackendType>, small_tensor_threshold: usize, } impl TensorConfig { fn from_env() -> Self { let preferred = env::var("TENSOR_BACKEND") .unwrap_or_else(|_| "auto".to_string()); let preferred_backend = match preferred.as_str() { "cpu" => BackendType::Cpu, #[cfg(feature = "wgpu")] "wgpu" => BackendType::Wgpu, #[cfg(feature = "cuda")] "cuda" => BackendType::Cuda, _ => { // Auto-select best available #[cfg(feature = "cuda")] { BackendType::Cuda } #[cfg(all(feature = "wgpu", not(feature = "cuda")))] { BackendType::Wgpu } #[cfg(all(not(feature = "wgpu"), not(feature = "cuda")))] { BackendType::Cpu } } }; let threshold = env::var("SMALL_TENSOR_THRESHOLD") .unwrap_or_else(|_| "10000".to_string()) .parse() .unwrap_or(10000); TensorConfig { preferred_backend, fallback_backends: vec![BackendType::Cpu], // Always fallback to CPU small_tensor_threshold: threshold, } } fn select_backend(&self, tensor_size: usize) -> BackendType { if tensor_size < self.small_tensor_threshold { BackendType::Cpu // Always use CPU for small tensors } else { self.preferred_backend } } } fn production_backend_usage() -> Result<()> { println!("=== Production Backend Usage ===\n"); let config = TensorConfig::from_env(); println!("Configuration: {:?}", config); // Use configuration for tensor operations let sizes = vec![100, 1000, 10000, 100000]; for size in sizes { let tensor = Tensor::ones(vec![size])?; let elements = tensor.numel(); let backend = config.select_backend(elements); let optimized_tensor = tensor.to_backend(backend)?; println!("Tensor size {}: using {:?} backend", elements, optimized_tensor.backend_type()); } Ok(()) } }
Application-Level Backend Strategy
#![allow(unused)] fn main() { struct TensorApplication { config: TensorConfig, } impl TensorApplication { fn new() -> Self { Self { config: TensorConfig::from_env(), } } fn process_data(&self, data: Vec<f32>, shape: Vec<usize>) -> Result<Tensor> { // Create tensor let tensor = Tensor::from_vec(data, shape)?; // Select optimal backend let backend = self.config.select_backend(tensor.numel()); let optimized_tensor = tensor.to_backend(backend)?; // Perform operations let processed = (&optimized_tensor * 2.0) + 1.0; let normalized = &processed / processed.sum(None)?; Ok(normalized) } fn batch_process(&self, batches: Vec<Vec<f32>>, shape: Vec<usize>) -> Result<Vec<Tensor>> { batches .into_iter() .map(|batch| self.process_data(batch, shape.clone())) .collect() } } }
Best Practices Summary
1. Size-Based Selection
- Small tensors (< 10K elements): Use CPU backend
- Medium tensors (10K - 1M elements): Consider WGPU
- Large tensors (> 1M elements): Prefer CUDA > WGPU > CPU
2. Operation-Based Selection
- I/O operations: Use CPU backend
- Element-wise operations: Use GPU backends for large tensors
- Reductions: GPU effective for very large tensors
- Large reductions: CUDA > CPU > WGPU (until WGPU reductions implemented)
3. Memory Management
- Convert to target backend early in pipeline
- Avoid frequent backend conversions
- Use batch processing when possible
- Monitor memory usage in production
4. Error Handling
- Always provide CPU fallback
- Handle backend-specific errors gracefully
- Use configuration for backend preferences
- Test with all available backends
5. Performance Optimization
- Benchmark with your specific workload
- Consider warmup time for GPU backends
- Profile memory transfer overhead
- Use appropriate tensor sizes for each backend
Next Steps
- Performance Guide - Advanced optimization techniques
- API Reference - Detailed backend API documentation
- Backend-Specific Guides - Deep dives into each backend
Performance Guide
This guide provides detailed information on optimizing Tensor Frame performance across different backends and use cases.
Performance Overview
Tensor Frame's performance characteristics vary significantly based on:
- Tensor size: Small vs large tensors have different optimal backends
- Operation type: Element-wise vs reductions vs matrix operations
- Backend selection: CPU vs WGPU vs CUDA performance profiles
- Memory patterns: Data locality and transfer overhead
Backend Performance Characteristics
CPU Backend
- Best for: Small tensors (< 10K elements), development, guaranteed availability
- Strengths: Low latency, no setup overhead, excellent debugging
- Limitations: Limited parallelism, memory bandwidth bound for large operations
#![allow(unused)] fn main() { use tensor_frame::Tensor; // CPU optimal: Small tensors and scalar operations let small = Tensor::ones(vec![100, 100])?; let result = small.sum(None)?; // ~0.1ms on modern CPU }
WGPU Backend
- Best for: Large element-wise operations (> 100K elements), cross-platform deployment
- Strengths: Massive parallelism, good memory bandwidth, portable
- Limitations: GPU setup overhead (~1-10ms), limited operation support
#![allow(unused)] fn main() { use tensor_frame::Tensor; // WGPU optimal: Large parallel operations let large = Tensor::ones(vec![2048, 2048])? .to_backend(BackendType::Wgpu)?; let result = (large_a * large_b) + large_c; // ~2ms on modern GPU }
CUDA Backend
- Best for: Very large operations (> 1M elements), production workloads
- Strengths: Peak performance, mature optimizations, cuBLAS integration
- Limitations: NVIDIA-only, CUDA toolkit requirement
#![allow(unused)] fn main() { use tensor_frame::Tensor; // CUDA optimal: Matrix operations and very large tensors let matrices = Tensor::ones(vec![4096, 4096])? .to_backend(BackendType::Cuda)?; let result = matrix_a.matmul(&matrix_b)?; // ~15ms with cuBLAS }
Operation-Specific Performance
Element-wise Operations
Performance Scaling:
- CPU: O(n) with thread-level parallelism (8-32 threads)
- WGPU: O(n) with massive parallelism (1000+ threads)
- CUDA: O(n) with optimal parallelism (10000+ threads)
#![allow(unused)] fn main() { use std::time::Instant; fn benchmark_element_wise() -> Result<()> { let sizes = vec![1000, 5000, 10000, 50000]; for size in sizes { let a = Tensor::ones(vec![size, size])?; let b = Tensor::ones(vec![size, size])?; // CPU timing let start = Instant::now(); let cpu_result = &a + &b; let cpu_time = start.elapsed(); // GPU timing (if available) #[cfg(feature = "wgpu")] { let gpu_a = a.to_backend(BackendType::Wgpu)?; let gpu_b = b.to_backend(BackendType::Wgpu)?; let start = Instant::now(); let gpu_result = &gpu_a + &gpu_b; let _sync = gpu_result.to_vec()?; let gpu_time = start.elapsed(); let speedup = cpu_time.as_nanos() as f64 / gpu_time.as_nanos() as f64; println!("Size {}x{}: CPU {:?}, GPU {:?}, Speedup: {:.1}x", size, size, cpu_time, gpu_time, speedup); } } Ok(()) } }
Reduction Operations
Performance Notes:
- CPU: Rayon parallel reduction, cache-efficient
- GPU: Requires multiple kernel launches for large reductions
- Memory-bound for large tensors
#![allow(unused)] fn main() { fn reduction_performance() -> Result<()> { let tensor = Tensor::ones(vec![10000, 10000])?; // 100M elements // Sum reduction timing let start = Instant::now(); let sum = tensor.sum(None)?; let cpu_time = start.elapsed(); println!("CPU sum reduction (100M elements): {:?}", cpu_time); println!("Result: {}", sum.to_vec()?[0]); Ok(()) } }
Memory Performance
Memory Transfer Costs
GPU operations include memory transfer overhead:
#![allow(unused)] fn main() { fn memory_transfer_analysis() -> Result<()> { let sizes = vec![1000, 5000, 10000]; for size in sizes { let tensor = Tensor::ones(vec![size, size])?; let elements = tensor.numel(); let bytes = elements * 4; // f32 = 4 bytes #[cfg(feature = "wgpu")] { // Time conversion to GPU let start = Instant::now(); let gpu_tensor = tensor.to_backend(BackendType::Wgpu)?; let upload_time = start.elapsed(); // Time conversion back to CPU let start = Instant::now(); let _data = gpu_tensor.to_vec()?; let download_time = start.elapsed(); let upload_bw = bytes as f64 / upload_time.as_secs_f64() / 1e9; // GB/s let download_bw = bytes as f64 / download_time.as_secs_f64() / 1e9; // GB/s println!("Size {}x{} ({} MB):", size, size, bytes / 1024 / 1024); println!(" Upload: {:?} ({:.1} GB/s)", upload_time, upload_bw); println!(" Download: {:?} ({:.1} GB/s)", download_time, download_bw); } } Ok(()) } }
Memory Layout Optimization
#![allow(unused)] fn main() { // Efficient: Contiguous memory access let matrix = Tensor::from_vec(data, vec![rows, cols])?; let transposed = matrix.transpose()?; // May require memory copy // Efficient: Operations that preserve layout let result = (&matrix_a + &matrix_b) * 2.0; // All operations maintain layout // Less efficient: Operations that break layout let reshaped = matrix.reshape(vec![cols, rows])?; // May require copy }
Optimization Strategies
1. Backend Selection Strategy
#![allow(unused)] fn main() { fn optimal_backend_for_workload(tensor_size: usize, operation: &str) -> BackendType { match (tensor_size, operation) { // Small tensors: CPU always optimal (0..=10_000, _) => BackendType::Cpu, // Large reductions: Prefer CUDA (_, "reduction") if tensor_size > 1_000_000 => { #[cfg(feature = "cuda")] { BackendType::Cuda } #[cfg(not(feature = "cuda"))] { BackendType::Cpu } } // Large element-wise: GPU beneficial (10_001..=1_000_000, "elementwise") => { #[cfg(feature = "wgpu")] { BackendType::Wgpu } #[cfg(not(feature = "wgpu"))] { BackendType::Cpu } } // Very large: Prefer CUDA > WGPU > CPU (1_000_001.., _) => { #[cfg(feature = "cuda")] { BackendType::Cuda } #[cfg(all(feature = "wgpu", not(feature = "cuda")))] { BackendType::Wgpu } #[cfg(all(not(feature = "wgpu"), not(feature = "cuda")))] { BackendType::Cpu } } // Default: CPU _ => BackendType::Cpu, } } }
2. Operation Fusion
#![allow(unused)] fn main() { // Efficient: Fused operations let result = ((a * b) + c) / d; // Single expression, potential fusion // Less efficient: Separate operations let temp1 = a * b; let temp2 = temp1 + c; let result = temp2 / d; // Multiple temporary allocations }
3. Batch Processing
#![allow(unused)] fn main() { fn efficient_batch_processing(batches: Vec<Tensor>) -> Result<Vec<Tensor>> { // Convert all to same backend once let backend = BackendType::Wgpu; let gpu_batches: Result<Vec<_>> = batches .into_iter() .map(|t| t.to_backend(backend)) .collect(); // Process on GPU gpu_batches? .into_iter() .map(|batch| { // Heavy computation on GPU (batch * 2.0) + 1.0 }) .collect() } }
4. Memory Pool Usage
#![allow(unused)] fn main() { // Efficient: Reuse similar-sized tensors struct TensorPool { cached_tensors: HashMap<Vec<usize>, Vec<Tensor>>, } impl TensorPool { fn get_or_create(&mut self, shape: Vec<usize>) -> Result<Tensor> { if let Some(cached) = self.cached_tensors.get_mut(&shape) { if let Some(tensor) = cached.pop() { return Ok(tensor); } } // Create new tensor if no cached version Tensor::zeros(shape) } fn return_tensor(&mut self, tensor: Tensor) { let shape = tensor.shape().dims().to_vec(); self.cached_tensors .entry(shape) .or_insert_with(Vec::new) .push(tensor); } } }
Profiling and Debugging
CPU Profiling
#![allow(unused)] fn main() { // Use built-in timing use std::time::Instant; let start = Instant::now(); let result = expensive_operation()?; println!("Operation took: {:?}", start.elapsed()); // Use external profilers // cargo install flamegraph // cargo flamegraph --bin your_app }
GPU Profiling
NVIDIA Tools (for CUDA backend):
# Nsight Systems for timeline analysis
nsys profile --stats=true ./your_app
# Nsight Compute for kernel analysis
ncu --metrics sm__throughput.avg.pct_of_peak_sustained_elapsed ./your_app
Platform Tools (for WGPU backend):
- Windows: PIX for Windows, RenderDoc
- macOS: Xcode Instruments (GPU Timeline)
- Linux: RenderDoc, Vulkan Tools
Memory Profiling
#![allow(unused)] fn main() { fn memory_usage_analysis() -> Result<()> { use std::alloc::{GlobalAlloc, Layout, System}; // Monitor system memory usage #[cfg(target_os = "linux")] { use std::fs; let status = fs::read_to_string("/proc/self/status")?; for line in status.lines() { if line.starts_with("VmRSS:") { println!("Memory usage: {}", line); } } } // GPU memory monitoring (platform-specific) #[cfg(feature = "cuda")] { // CUDA memory info let (free, total) = cuda::memory_info()?; println!("GPU memory: {} MB free of {} MB total", free / 1024 / 1024, total / 1024 / 1024); } Ok(()) } }
Performance Benchmarking
Comprehensive Benchmark Suite
#![allow(unused)] fn main() { use criterion::{criterion_group, criterion_main, Criterion}; fn bench_tensor_operations(c: &mut Criterion) { let sizes = vec![100, 500, 1000, 2000]; for size in sizes { let a = Tensor::ones(vec![size, size]).unwrap(); let b = Tensor::ones(vec![size, size]).unwrap(); // CPU benchmark c.bench_function(&format!("cpu_add_{}x{}", size, size), |bench| { bench.iter(|| { let _result = &a + &b; }); }); // GPU benchmark (if available) #[cfg(feature = "wgpu")] { let gpu_a = a.to_backend(BackendType::Wgpu).unwrap(); let gpu_b = b.to_backend(BackendType::Wgpu).unwrap(); c.bench_function(&format!("gpu_add_{}x{}", size, size), |bench| { bench.iter(|| { let result = &gpu_a + &gpu_b; let _sync = result.to_vec().unwrap(); // Force sync }); }); } } } criterion_group!(benches, bench_tensor_operations); criterion_main!(benches); }
Performance Troubleshooting
Common Performance Issues
- Small Tensors on GPU
#![allow(unused)] fn main() { // Problem: GPU overhead for small operations let small = Tensor::ones(vec![10, 10])?; let slow = small.to_backend(BackendType::Wgpu)?; // Overhead > computation // Solution: Use CPU for small tensors let fast = small; // Stay on CPU }
- Frequent Backend Conversions
#![allow(unused)] fn main() { // Problem: Repeated conversions for i in 0..1000 { let gpu_tensor = cpu_tensor.to_backend(BackendType::Wgpu)?; let result = gpu_tensor + 1.0; let back_to_cpu = result.to_backend(BackendType::Cpu)?; } // Solution: Convert once let gpu_tensor = cpu_tensor.to_backend(BackendType::Wgpu)?; for i in 0..1000 { gpu_tensor = gpu_tensor + 1.0; // Stay on GPU } let final_result = gpu_tensor.to_backend(BackendType::Cpu)?; }
- Memory Fragmentation
#![allow(unused)] fn main() { // Problem: Large temporary allocations let huge_temp = (huge_a * huge_b) + huge_c; // 3 large tensors in memory // Solution: In-place operations (when available) let result = huge_a.mul_add(&huge_b, &huge_c)?; // Hypothetical in-place op }
Performance Debugging Checklist
- Profile first: Measure before optimizing
- Check backend selection: Ensure optimal backend for workload
- Monitor memory transfers: GPU transfer costs often dominate
- Verify operation fusion: Combine operations when possible
- Consider batch size: Larger batches amortize overhead
- Test different tensor sizes: Performance characteristics vary by size
- Use appropriate data types: f32 vs f64 performance difference
- Monitor memory usage: Avoid memory pressure and swapping
Hardware-Specific Optimization
CPU Optimization
- Use all available cores (Rayon handles this automatically)
- Ensure sufficient memory bandwidth
- Consider NUMA topology for large systems
- Link with optimized BLAS (OpenBLAS, Intel MKL)
GPU Optimization
- Ensure sufficient GPU memory
- Consider tensor sizes that align with GPU architecture
- Use appropriate batch sizes for GPU utilization
- Monitor thermal throttling on mobile/laptop GPUs
Memory Hierarchy
- L1/L2 cache: Small frequently-accessed tensors
- System RAM: Medium tensors and CPU operations
- GPU VRAM: Large tensors for GPU operations
- Storage: Streaming large datasets
Conclusion
Tensor Frame performance optimization requires understanding:
- Workload characteristics: Size, operations, access patterns
- Backend strengths: CPU for small/mixed, GPU for large parallel
- Memory costs: Transfer overhead, allocation patterns
- Platform specifics: Hardware capabilities and limitations
Use profiling tools to guide optimization decisions and always measure performance improvements to ensure they provide real benefits for your specific use case.
Contributing to Tensor Frame
We welcome contributions to Tensor Frame! This guide will help you get started with contributing to the project.
Getting Started
Development Setup
- Clone the repository:
git clone https://github.com/TrainPioneers/Tensor-Frame.git
cd Tensor-Frame
- Install Rust (if not already installed):
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source ~/.cargo/env
- Install development dependencies:
# For documentation building
cargo install mdbook
# For benchmarking
cargo install criterion
# For code formatting
rustup component add rustfmt
# For linting
rustup component add clippy
- Build and test:
# Build with all features
cargo build --all-features
# Run tests
cargo test
# Run with specific backend
cargo test --features wgpu
cargo test --features cuda
Development Workflow
Building the Project
# Quick compilation check
cargo check
# Build with specific backends
cargo build --features wgpu
cargo build --features cuda
cargo build --all-features
# Release build
cargo build --release --all-features
Running Tests
# Run all tests
cargo test
# Test specific backend
make test-wgpu
make test-cuda
# Test with verbose output
cargo test -- --nocapture
# Run specific test
cargo test test_tensor_creation
Code Formatting and Linting
# Format code
cargo fmt
# Check formatting
cargo fmt --check
# Run clippy lints
cargo clippy
# Run clippy with all features
cargo clippy --all-features
# Fix clippy warnings
cargo clippy --fix
Documentation
# Generate API documentation
cargo doc --open
# Build the book
cd docs
mdbook build
# Serve book locally
mdbook serve
Contribution Guidelines
Code Style
- Formatting: Use
cargo fmt
for consistent formatting - Linting: Address all
cargo clippy
warnings - Naming: Use descriptive names following Rust conventions
- Comments: Document public APIs and complex algorithms
- Error Handling: Use proper
Result
types and meaningful error messages
Testing
All contributions must include appropriate tests:
#![allow(unused)] fn main() { #[cfg(test)] mod tests { use super::*; #[test] fn test_new_feature() { let tensor = Tensor::zeros(vec![2, 3]).unwrap(); let result = tensor.new_operation().unwrap(); assert_eq!(result.shape().dims(), &[2, 3]); } #[test] fn test_error_handling() { let tensor = Tensor::zeros(vec![2, 3]).unwrap(); let result = tensor.invalid_operation(); assert!(result.is_err()); } } }
Documentation Requirements
- Public APIs: All public functions, structs, and traits must have documentation
- Examples: Include usage examples in documentation
- Error Cases: Document when functions return errors
- Safety: Document any unsafe code usage
#![allow(unused)] fn main() { /// Creates a new tensor filled with zeros. /// /// # Arguments /// * `shape` - The dimensions of the tensor /// /// # Returns /// A new tensor filled with zeros, or an error if the shape is invalid. /// /// # Examples /// ``` /// use tensor_frame::Tensor; /// /// let tensor = Tensor::zeros(vec![2, 3])?; /// assert_eq!(tensor.numel(), 6); /// # Ok::<(), tensor_frame::TensorError>(()) /// ``` /// /// # Errors /// Returns `TensorError::InvalidShape` if any dimension is zero. pub fn zeros(shape: Vec<usize>) -> Result<Self> { // Implementation } }
Types of Contributions
Bug Fixes
-
Report the issue: Create a GitHub issue with:
- Clear reproduction steps
- Expected vs actual behavior
- Environment details (OS, Rust version, GPU info)
- Minimal code example
-
Fix the bug:
- Create a focused fix addressing the specific issue
- Add regression tests to prevent recurrence
- Update documentation if the bug was in documented behavior
New Features
Before implementing new features:
-
Discuss the feature: Open a GitHub issue to discuss:
- Use case and motivation
- Proposed API design
- Implementation approach
- Performance implications
-
Implementation guidelines:
- Follow existing patterns and conventions
- Implement for all relevant backends
- Add comprehensive tests
- Update documentation and examples
Backend Implementation
New operations should be implemented across all backends:
#![allow(unused)] fn main() { // src/backend/mod.rs pub trait Backend { // Add new operation to trait fn new_operation(&self, input: &Storage) -> Result<Storage>; } // src/backend/cpu.rs impl Backend for CpuBackend { fn new_operation(&self, input: &Storage) -> Result<Storage> { match input { Storage::Cpu(data) => { // CPU implementation using Rayon let result: Vec<f32> = data .par_iter() .map(|&x| compute_new_operation(x)) .collect(); Ok(Storage::Cpu(result)) } _ => Err(TensorError::BackendError("Invalid storage type".to_string())), } } } // src/backend/wgpu.rs impl Backend for WgpuBackend { fn new_operation(&self, input: &Storage) -> Result<Storage> { match input { Storage::Wgpu(wgpu_storage) => { // WGPU implementation using compute shaders self.execute_compute_shader( &wgpu_storage.buffer, include_str!("../shaders/new_operation.wgsl") ) } _ => Err(TensorError::BackendError("Invalid storage type".to_string())), } } } }
Performance Improvements
- Benchmark first: Establish baseline performance
- Profile the bottleneck: Use profiling tools to identify issues
- Implement optimization: Make targeted improvements
- Measure improvement: Verify performance gains
- Add performance tests: Prevent performance regressions
#![allow(unused)] fn main() { // Add benchmark for new optimization use criterion::{criterion_group, criterion_main, Criterion}; fn bench_optimized_operation(c: &mut Criterion) { let tensor = Tensor::ones(vec![1000, 1000]).unwrap(); c.bench_function("optimized_operation", |b| { b.iter(|| { tensor.optimized_operation().unwrap() }); }); } criterion_group!(benches, bench_optimized_operation); criterion_main!(benches); }
Documentation Improvements
- API documentation: Improve function/struct documentation
- Examples: Add or improve usage examples
- Guides: Write tutorials for specific use cases
- Book: Contribute to the mdbook documentation
Backend-Specific Contributions
CPU Backend
- Optimization: Improve Rayon parallelization
- BLAS integration: Better integration with optimized BLAS libraries
- Memory layout: Optimize for cache efficiency
WGPU Backend
- Shader optimization: Improve WGSL compute shaders
- New operations: Implement missing operations (matmul, reductions)
- Platform support: Improve compatibility across graphics APIs
CUDA Backend
- Kernel optimization: Improve CUDA kernel performance
- cuBLAS integration: Better integration with cuBLAS/cuDNN
- Memory management: Optimize GPU memory usage
Pull Request Process
Before Submitting
- Ensure tests pass:
cargo test --all-features
- Check formatting and lints:
cargo fmt --check
cargo clippy --all-features
- Update documentation:
cargo doc --all-features
cd docs && mdbook build
- Add changelog entry (if applicable):
## [Unreleased]
### Added
- New tensor operation `my_operation` (#123)
### Fixed
- Fixed broadcasting bug in GPU backend (#124)
Pull Request Template
## Description
Brief description of the changes and motivation.
## Type of Change
- [ ] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
- [ ] Documentation update
## Testing
- [ ] I have added tests that prove my fix is effective or that my feature works
- [ ] New and existing unit tests pass locally with my changes
- [ ] I have tested with different backends (CPU/WGPU/CUDA)
## Checklist
- [ ] My code follows the code style of this project
- [ ] I have performed a self-review of my own code
- [ ] I have commented my code, particularly in hard-to-understand areas
- [ ] I have made corresponding changes to the documentation
- [ ] My changes generate no new warnings
- [ ] Any dependent changes have been merged and published
Review Process
- Automated checks: CI will run tests, linting, and formatting checks
- Code review: Maintainers will review for:
- Code quality and style
- Test coverage
- Documentation completeness
- Performance implications
- API design consistency
- Feedback: Address review feedback and update the PR
- Approval: Once approved, maintainers will merge the PR
Issue Reporting
Bug Reports
Use the bug report template:
**Describe the bug**
A clear and concise description of what the bug is.
**To Reproduce**
Steps to reproduce the behavior:
1. Create tensor with '...'
2. Call operation '....'
3. See error
**Expected behavior**
A clear and concise description of what you expected to happen.
**Code Example**
```rust
use tensor_frame::Tensor;
let tensor = Tensor::zeros(vec![2, 3])?;
let result = tensor.problematic_operation()?; // This fails
Environment:
- OS: [e.g. Ubuntu 20.04]
- Rust version: [e.g. 1.75.0]
- Tensor Frame version: [e.g. 0.1.0]
- GPU info: [if applicable]
- Backend: [CPU/WGPU/CUDA]
Additional context Add any other context about the problem here.
### Feature Requests
Use the feature request template:
```markdown
**Is your feature request related to a problem?**
A clear and concise description of what the problem is.
**Describe the solution you'd like**
A clear and concise description of what you want to happen.
**Describe alternatives you've considered**
A clear and concise description of any alternative solutions or features you've considered.
**Use case**
Describe how this feature would be used in practice.
**API Design** (if applicable)
```rust
// Proposed API
let result = tensor.new_operation(parameters)?;
Additional context Add any other context about the feature request here.
## Community Guidelines
### Code of Conduct
- Be respectful and inclusive
- Focus on constructive feedback
- Help newcomers learn and contribute
- Celebrate diverse perspectives and backgrounds
### Communication
- **GitHub Issues**: Bug reports, feature requests, design discussions
- **GitHub Discussions**: General questions, show and tell, ideas
- **Pull Requests**: Code contributions and reviews
### Recognition
Contributors are recognized in:
- `CONTRIBUTORS.md` file
- Release notes for significant contributions
- GitHub contributor statistics
## Getting Help
If you need help contributing:
1. **Read existing code**: Look at similar implementations for patterns
2. **Check documentation**: API docs and this book contain guidance
3. **Ask questions**: Open a GitHub issue or discussion
4. **Start small**: Begin with bug fixes or documentation improvements
Thank you for contributing to Tensor Frame!