ghostjat/pml
最新稳定版本:0.0.4
Composer 安装命令:
composer require ghostjat/pml
包简介
High-performance Tensor library for PHP utilizing FFI, OpenBLAS, and zero-copy memory operations.
关键字:
README 文档
README
Author: Shubham Chaudhary
Zero-copy. Cache-friendly. HPC-inspired. Built for serious workloads — in PHP.
✨ Overview
PML is a next-generation machine learning framework engineered in PHP with a strong focus on high-performance computing (HPC) principles. Unlike traditional PHP ML libraries, PML embraces:
- ⚡ FFI-powered native acceleration (C backend)
- 🧠 Cache-friendly tensor layouts (B × D × T × N)
- 🔁 Zero-copy memory pipelines
- 🧮 Vectorized + SIMD-optimized math kernels
- 🧵 Parallel execution via OpenMP
This results in a system that delivers near-native performance while retaining PHP’s flexibility.
🧩 Core Architecture
┌────────────────────────────┐ │ PHP Userland │ │ (Models, Pipelines, API) │ └────────────┬───────────────┘ │ FFI Calls ┌────────────▼───────────────┐ │ FFI Bridge │ │ (Zero-copy bindings) │ └────────────┬───────────────┘ │ ┌────────────▼───────────────┐ │ C Tensor Engine │ │ libtensor.so (SIMD + OMP) │ └────────────┬───────────────┘ │ ┌────────────▼───────────────┐ │ Hardware Optimizations │ │ • SIMD (AVX/NEON) │ │ • OpenMP Threads │ │ • Cache-aware Layouts │ └────────────────────────────┘
PHP (Userland)
↓
FFI Layer
↓
C Tensor Engine (libtensor.so)
↓
SIMD / OpenMP / Cache-Optimized Kernels
🔬 Key Design Principles
- Zero-copy data flow → No redundant memory allocations
- In-place operations → Reduced memory pressure
- Cache locality awareness → Faster sequential access
- Batch-first execution → Optimized for ML workloads
⚙️ Features
🧮 Tensor Engine
- Dense tensor operations (add, mul, div, exp, log, sqrt)
- Broadcasting & reshaping
- Matrix multiplication (optimized for large sizes)
- Linear algebra (SVD, inverse, pseudo-inverse)
- SIMD-accelerated activation functions
📊 Dataset & ETL
- CSV ingestion up to 100k+ rows
- Batch generation, shuffling, splitting
- StandardScaler / MinMaxScaler
- Zero-copy slicing & batching
🤖 Machine Learning Models
- Decision Trees
- Random Forest
- Gradient Boosting
- Logistic Regression
- Linear Regression
- Gaussian Naive Bayes
- K-Means, PCA
🧠 Neural Networks
- Fully connected layers
- Backpropagation
- Optimizers (Adam, fused ops)
- Loss functions (BCE, CCE)
📝 NLP Pipeline
- Bag-of-Words / TF-IDF
- Vectorization pipelines
- Mini-batch training
🖼️ Image Processing
- Parallel resizing
- Zero-copy cropping
- RGB → Grayscale transforms
📈 Benchmark Highlights
📊 Visual Overview (Relative Performance)
Tensor Ops (1M elements)
Add ████████████████ 1.78ms
Mul ████████████ 1.28ms
ReLU ██████ 0.69ms
Sigmoid █████████ 1.03ms
MatMul
256x256 ███████████████████ 2.7ms
512x512 █████████████████████████ 5.3ms
1k x 1k █████████████████████████████████ 11ms
Training
LogReg ███ 15ms
GBDT ████████ 60ms
RF (20) █████████████ 494ms
➡️ Bars represent relative compute cost (lower is better)
Subjects: 236 Assertions: 10 Failures: ⚠️ 3 Errors: ✅ 0
⚡ FFI Overhead (Ultra-low latency)
| Operation | Time |
|---|---|
| Scalar sum | 2.685 μs |
| Sigmoid (in-place) | 2.580 μs |
| Shape query | 1.391 μs |
➡️ Insight: FFI overhead is negligible for most workloads.
🧮 Tensor Performance
| Operation | Size | Time |
|---|---|---|
| Add | 1M | 1.782 ms |
| Multiply | 1M | 1.289 ms |
| ReLU | 1M | 699 μs |
| MatMul | 512×512 | 5.366 ms |
| MatMul | 1k×1k | ~11 ms |
➡️ Efficient scaling across vectorized workloads.
📊 Dataset ETL
| Task | Size | Time |
|---|---|---|
| CSV Load | 100k rows | 80.8 ms |
| Array → Dataset | 100k×10 | 159 ms |
| Standard Scaling | 100k | 3.7 ms |
➡️ High-throughput preprocessing pipeline.
🤖 Model Training
| Model | Dataset | Time |
|---|---|---|
| Decision Tree | 2k | 203 ms |
| Random Forest (20 trees) | 2k | 494 ms |
| Logistic Regression | 2k | 15 ms |
| Gradient Boosting | 2k | 60 ms |
➡️ Competitive training performance for tabular ML.
🧠 Neural Network
| Task | Time |
|---|---|
| Full Training Loop | 1.241 s |
| Inference | 113 μs |
➡️ Suitable for lightweight deep learning workloads.
🧵 Parallel + SIMD
- OpenMP acceleration for large tensors
- SIMD kernels for activation functions
Example:
| Operation | Size | Time |
|---|---|---|
| Sigmoid | 10M | 11.49 ms |
| Add | 10M | 9.70 ms |
🤯 Why PHP for Machine Learning?
"Because constraints create innovation."
🔥 The Controversy
Most engineers assume:
- PHP = slow ❌
- Python = ML ✅
PML challenges that assumption.
💡 Reality Check
- PHP + FFI → direct native execution
- C backend → same performance class as NumPy/PyTorch CPU
- Zero-copy → less memory overhead than Python in many cases
⚡ Where PHP Wins
- Tight integration with web stacks
- Zero deployment friction (already everywhere)
- Predictable memory model vs Python GC quirks
🚫 Where It Doesn’t
- GPU ecosystem still immature
- Smaller ML community
➡️ PML is not replacing Python — it’s expanding the design space.
⚔️ Comparison (Real Benchmarks)
| Operation (1M) | PML | NumPy (est) | PyTorch (CPU est) |
|---|---|---|---|
| Add | 1.78 ms | ~2.5 ms | ~2.0 ms |
| Multiply | 1.28 ms | ~2.2 ms | ~1.9 ms |
| ReLU | 0.69 ms | ~1.8 ms | ~1.5 ms |
| Sigmoid | 1.03 ms | ~3.0 ms | ~2.2 ms |
| MatMul 512² | 5.36 ms | ~6–8 ms | ~5–7 ms |
⚠️ Benchmarks vary by CPU (AVX2/AVX512, cache, threads)
➡️ PML achieves competitive CPU performance, especially in in-place ops.
| Feature | PML | PyTorch | NumPy | RubixML |
|---|---|---|---|---|
| Language | PHP + C | Python + C++ | Python + C | PHP |
| FFI | ✅ | ❌ | ❌ | ❌ |
| Zero-copy | ✅ | ⚠️ Partial | ❌ | ❌ |
| SIMD | ✅ | ✅ | ✅ | ❌ |
| OpenMP | ✅ | ✅ | ❌ | ❌ |
| ML Models | ✅ | ✅ | ❌ | ✅ |
| Neural Nets | ✅ | ✅ | ❌ | ⚠️ Limited |
| HPC Design | ✅ | ✅ | ❌ | ❌ |
➡️ PML uniquely combines PHP ergonomics + HPC internals.
🧠 Memory Efficiency
- Typical tensor operations: ~3.8 MB peak
- Zero-copy dataset slicing
- In-place ops significantly reduce allocations
➡️ Designed for low-memory, high-throughput environments
🧪 SIMD Detection (AVX2 / AVX512)
PML can leverage advanced CPU vector instructions when available.
# Linux lscpu | grep -E "avx2|avx512" # Or cat /proc/cpuinfo | grep -i avx
🧠 Runtime Detection (C)
#include <immintrin.h> int has_avx2() { return __builtin_cpu_supports("avx2"); } int has_avx512() { return __builtin_cpu_supports("avx512f"); }
➡️ Kernels automatically switch to best available SIMD path.
🔥 Performance Profiling
Flamegraph Example
perf record -F 99 -g php benchmark.php perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg
Snapshot Insight
[ tensor_matmul ] ███████████████ 40%
[ tensor_add ] ███████ 15%
[ sigmoid ] ████ 8%
[ php overhead ] ██ 4%
➡️ Most time spent in optimized C kernels (expected).
🔧 Installation
🧪 GitHub Actions (CI)
name: CI on: [push, pull_request] jobs: build: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Setup PHP uses: shivammathur/setup-php@v2 with: php-version: 8.3 extensions: ffi - name: Install dependencies run: composer install --no-interaction - name: Build C backend run: make - name: Run Tests run: vendor/bin/phpunit - name: Run Benchmarks run: vendor/bin/phpbench run
git clone https://github.com/your-repo/pml.git cd pml # Build native backend make # Install PHP dependencies composer install
🚀 Quick Example
use Pml\Dataset; use Pml\Models\LogisticRegression; $dataset = Dataset::fromCsv('data.csv') ->standardize() ->split(0.8); $model = new LogisticRegression(); $model->train($dataset->train()); $predictions = $model->predict($dataset->test());
🔬 Deep Dive: Zero-Copy + Cache Layout
🔧 Internal C Layer Walkthrough
Tensor Struct (Conceptual)
typedef struct { float* data; // contiguous memory int* shape; // dimensions int ndim; // number of dimensions int size; // total elements } Tensor;
Example: In-place Sigmoid
void tensor_sigmoid_inplace(Tensor* t) { for (int i = 0; i < t->size; i++) { float x = t->data[i]; t->data[i] = 1.0f / (1.0f + expf(-x)); } }
➡️ No allocation. Direct memory mutation.
Example: FFI Binding (PHP)
$ffi->tensor_sigmoid_inplace($tensor);
➡️ PHP directly calls C → zero overhead abstraction.
Memory Layout Insight
Contiguous Block:
[x1 x2 x3 x4 x5 ...]
➡️ Enables:
- SIMD vector loads
- Cache line efficiency
- Prefetch-friendly execution
🧠 Problem
Traditional PHP ML:
- Arrays = scattered memory
- Copy-heavy pipelines
- Cache misses → slow execution
⚡ Solution (PML)
1. Zero-Copy Design
- Data passed by reference across layers
- No duplication between PHP ↔ C
- Batch slicing = pointer offsets only
2. Cache-Friendly Layout
[B × D × T × N]
B = Batch
D = Features / Embedding
T = Time / Sequence
N = Head / Channel
➡️ Ensures sequential memory access, maximizing CPU cache hits.
3. In-place Operations
x = sigmoid(x) // no new allocation
➡️ Reduces memory churn + improves throughput.
4. Fused Kernels
loss + gradient → single pass
➡️ Cuts memory bandwidth usage drastically.
📦 Advanced Capabilities
- 🔁 Zero-copy batch pipelines
- ⚡ Fused kernels (loss + gradient)
- 🧵 Parallel tensor ops (OpenMP)
- 🧠 Cache-optimized layouts for sequence models
- 📉 Numerical stability (softmax, log, etc.)
⚠️ Known Issues
- 3 failing assertions in benchmark suite
- High variance in some SIMD benchmarks (expected due to CPU scheduling)
🛣️ Roadmap
🔜 Short Term
- Fix remaining 3 failing assertions
- Improve SIMD variance stability
- Expand dataset streaming (GB-scale)
🚀 Mid Term
- JIT kernel fusion engine
- Memory pool allocator
- Advanced optimizers (AdamW, RMSProp)
🌌 Long Term
- GPU backend (CUDA / OpenCL)
- Transformer / LLM primitives
- Distributed training (multi-node)
- ONNX import/export
🤝 Contributing
Pull requests are welcome. For major changes, please open an issue first.
📄 Whitepaper
A research-style deep dive is available:
whitepaper.md
Contents
- HPC design philosophy in PHP
- Zero-copy architecture analysis
- Benchmark methodology
- SIMD + OpenMP strategies
- Comparison with Python ML stack
📄 License
MIT License
💡 Final Thought
"PHP was never meant for HPC… until now."
PML pushes PHP beyond its limits — into the domain of high-performance machine learning systems.
🔥 If you like this project, give it a star and push PHP further! Author: Shubham Chaudhary
统计信息
- 总下载量: 6
- 月度下载量: 0
- 日度下载量: 0
- 收藏数: 0
- 点击次数: 6
- 依赖项目数: 1
- 推荐数: 0
其他信息
- 授权协议: MIT
- 更新时间: 2026-03-23