Backends Overview

Tensor Frame's backend system provides a pluggable architecture for running tensor operations on different computational devices. This allows the same high-level tensor API to transparently utilize CPU cores, integrated GPUs, discrete GPUs, and specialized accelerators.

Available Backends

Backend	Feature Flag	Availability	Best Use Cases
CPU	`cpu` (default)	Always	Small tensors, development, fallback
WGPU	`wgpu`	Cross-platform GPU	Large tensors, cross-platform deployment
CUDA	`cuda`	NVIDIA GPUs	High-performance production workloads

Backend Selection Strategy

Automatic Selection (Recommended)

By default, Tensor Frame automatically selects the best available backend using this priority order:

CUDA - Highest performance on NVIDIA hardware
WGPU - Cross-platform GPU acceleration
CPU - Universal fallback

#![allow(unused)]
fn main() {
use tensor_frame::Tensor;

// Automatically uses best available backend
let tensor = Tensor::zeros(vec![1000, 1000])?;
println!("Using backend: {:?}", tensor.backend_type());
}

Manual Backend Control

For specific requirements, you can control backend selection:

#![allow(unused)]
fn main() {
use tensor_frame::backend::{set_backend_priority, BackendType};

// Force CPU-only execution
let backend = set_backend_priority(vec![BackendType::Cpu]);

// Prefer WGPU over CUDA
let backend = set_backend_priority(vec![
    BackendType::Wgpu,
    BackendType::Cuda,
    BackendType::Cpu
]);
}

Per-Tensor Backend Conversion

Convert individual tensors between backends:

#![allow(unused)]
fn main() {
let cpu_tensor = Tensor::ones(vec![100, 100])?;

// Move to GPU
let gpu_tensor = cpu_tensor.to_backend(BackendType::Wgpu)?;

// Move back to CPU  
let back_to_cpu = gpu_tensor.to_backend(BackendType::Cpu)?;
}

Performance Characteristics

CPU Backend

Latency: Lowest for small operations (< 1ms)
Throughput: Limited by CPU cores and memory bandwidth
Memory: System RAM (typically abundant)
Parallelism: Thread-level via Rayon
Overhead: Minimal function call overhead

WGPU Backend

Latency: Higher initialization cost (~1-10ms)
Throughput: High for large, parallel operations
Memory: GPU memory (limited but fast)
Parallelism: Massive thread-level via compute shaders
Overhead: GPU command submission and synchronization

CUDA Backend

Latency: Moderate initialization cost (~0.1-1ms)
Throughput: Highest for supported operations
Memory: GPU memory with CUDA optimizations
Parallelism: Optimal GPU utilization via cuBLAS/cuDNN
Overhead: Minimal with mature driver stack

When to Use Each Backend

CPU Backend

#![allow(unused)]
fn main() {
// Good for:
let small_tensor = Tensor::ones(vec![10, 10])?;        // Small tensors
let dev_tensor = Tensor::zeros(vec![100])?;            // Development/testing
let scalar_ops = tensor.sum(None)?;                    // Scalar results

// Avoid for:
// - Large matrix multiplications (> 1000x1000)
// - Batch operations on many tensors
// - Compute-intensive element-wise operations
}

WGPU Backend

#![allow(unused)]
fn main() {
// Good for:
let large_tensor = Tensor::zeros(vec![2048, 2048])?;   // Large tensors
let batch_ops = tensors.iter().map(|t| t * 2.0);      // Batch operations
let element_wise = (a * b) + c;                       // Complex element-wise

// Consider for:
// - Cross-platform deployment
// - When CUDA is not available
// - Mixed CPU/GPU workloads
}

CUDA Backend

#![allow(unused)]
fn main() {
// Excellent for:
let huge_tensor = Tensor::zeros(vec![4096, 4096])?;    // Very large tensors
let matrix_mul = a.matmul(&b)?;                        // Matrix operations
let ml_workload = model.forward(input)?;               // ML training/inference

// Best when:
// - NVIDIA GPU available
// - Performance is critical
// - Using alongside other CUDA libraries
}

Cross-Backend Operations

Operations between tensors on different backends automatically handle conversion:

#![allow(unused)]
fn main() {
let cpu_a = Tensor::ones(vec![1000])?;
let gpu_b = Tensor::zeros(vec![1000])?.to_backend(BackendType::Wgpu)?;

// Automatically converts to common backend
let result = cpu_a + gpu_b;  // Runs on CPU backend
}

Conversion Rules:

If backends match, operation runs on that backend
If backends differ, converts to the "lower priority" backend
Priority order: CPU > WGPU > CUDA (CPU is most compatible)

Memory Management

Reference Counting

All backends use reference counting for efficient memory management:

#![allow(unused)]
fn main() {
let tensor1 = Tensor::ones(vec![1000, 1000])?;
let tensor2 = tensor1.clone();  // O(1) - just increments reference count

// Memory freed automatically when last reference dropped
}

Cross-Backend Memory

Converting between backends allocates new memory:

#![allow(unused)]
fn main() {
let cpu_tensor = Tensor::ones(vec![1000])?;         // 4KB CPU memory
let gpu_tensor = cpu_tensor.to_backend(BackendType::Wgpu)?;  // +4KB GPU memory

// Both tensors exist independently until dropped
}

Memory Usage Guidelines

Development: Use CPU backend to avoid GPU memory pressure
Production: Convert to GPU early, minimize cross-backend copies
Mixed workloads: Keep frequently-accessed tensors on CPU
Large datasets: Stream data through GPU backends

Error Handling

Backend operations can fail for various reasons:

#![allow(unused)]
fn main() {
match Tensor::zeros(vec![100000, 100000]) {
    Ok(tensor) => println!("Created tensor on {:?}", tensor.backend_type()),
    Err(TensorError::BackendError(msg)) => {
        eprintln!("Backend error: {}", msg);
        // Fallback to smaller size or different backend
    }
    Err(e) => eprintln!("Other error: {}", e),
}
}

Common Error Scenarios:

GPU Out of Memory: Try smaller tensors or CPU backend
Backend Unavailable: Fallback to CPU backend
Feature Not Implemented: Some operations only available on certain backends
Cross-Backend Type Mismatch: Ensure compatible data types

Backend Implementation Status

Operation	CPU	WGPU	CUDA
Basic arithmetic (+, -, *, /)	✅	✅	✅
Reductions (sum, mean)	✅	❌	✅
Reshape, transpose	✅	✅	✅
Broadcasting	✅	✅	✅

✅ = Fully implemented
❌ = Not yet implemented
⚠️ = Partially implemented

Keyboard shortcuts

Tensor Frame Documentation