README

Author: Shubham Chaudhary

Zero-copy. Cache-friendly. HPC-inspired. Built for serious workloads — in PHP.

✨ Overview

PML is a next-generation machine learning framework engineered in PHP with a strong focus on high-performance computing (HPC) principles. Unlike traditional PHP ML libraries, PML embraces:

⚡ FFI-powered native acceleration (C backend)
🧠 Cache-friendly tensor layouts (B × D × T × N)
🔁 Zero-copy memory pipelines
🧮 Vectorized + SIMD-optimized math kernels
🧵 Parallel execution via OpenMP

This results in a system that delivers near-native performance while retaining PHP’s flexibility.

🧩 Core Architecture

┌────────────────────────────┐ │ PHP Userland │ │ (Models, Pipelines, API) │ └────────────┬───────────────┘ │ FFI Calls ┌────────────▼───────────────┐ │ FFI Bridge │ │ (Zero-copy bindings) │ └────────────┬───────────────┘ │ ┌────────────▼───────────────┐ │ C Tensor Engine │ │ libtensor.so (SIMD + OMP) │ └────────────┬───────────────┘ │ ┌────────────▼───────────────┐ │ Hardware Optimizations │ │ • SIMD (AVX/NEON) │ │ • OpenMP Threads │ │ • Cache-aware Layouts │ └────────────────────────────┘

PHP (Userland)
   ↓
FFI Layer
   ↓
C Tensor Engine (libtensor.so)
   ↓
SIMD / OpenMP / Cache-Optimized Kernels

🔬 Key Design Principles

Zero-copy data flow → No redundant memory allocations
In-place operations → Reduced memory pressure
Cache locality awareness → Faster sequential access
Batch-first execution → Optimized for ML workloads

⚙️ Features

🧮 Tensor Engine

Dense tensor operations (add, mul, div, exp, log, sqrt)
Broadcasting & reshaping
Matrix multiplication (optimized for large sizes)
Linear algebra (SVD, inverse, pseudo-inverse)
SIMD-accelerated activation functions

📊 Dataset & ETL

CSV ingestion up to 100k+ rows
Batch generation, shuffling, splitting
StandardScaler / MinMaxScaler
Zero-copy slicing & batching

🤖 Machine Learning Models

Decision Trees
Random Forest
Gradient Boosting
Logistic Regression
Linear Regression
Gaussian Naive Bayes
K-Means, PCA

🧠 Neural Networks

Fully connected layers
Backpropagation
Optimizers (Adam, fused ops)
Loss functions (BCE, CCE)

📝 NLP Pipeline

Bag-of-Words / TF-IDF
Vectorization pipelines
Mini-batch training

🖼️ Image Processing

Parallel resizing
Zero-copy cropping
RGB → Grayscale transforms

📈 Benchmark Highlights

📊 Visual Overview (Relative Performance)

Tensor Ops (1M elements)
Add        ████████████████ 1.78ms
Mul        ████████████     1.28ms
ReLU       ██████           0.69ms
Sigmoid    █████████        1.03ms

MatMul
256x256    ███████████████████ 2.7ms
512x512    █████████████████████████ 5.3ms
1k x 1k    █████████████████████████████████ 11ms

Training
LogReg     ███ 15ms
GBDT       ████████ 60ms
RF (20)    █████████████ 494ms

➡️ Bars represent relative compute cost (lower is better)

Subjects: 236 Assertions: 10 Failures: ⚠️ 3 Errors: ✅ 0

⚡ FFI Overhead (Ultra-low latency)

Operation	Time
Scalar sum	2.685 μs
Sigmoid (in-place)	2.580 μs
Shape query	1.391 μs

➡️ Insight: FFI overhead is negligible for most workloads.

🧮 Tensor Performance

Operation	Size	Time
Add	1M	1.782 ms
Multiply	1M	1.289 ms
ReLU	1M	699 μs
MatMul	512×512	5.366 ms
MatMul	1k×1k	~11 ms

➡️ Efficient scaling across vectorized workloads.

📊 Dataset ETL

Task	Size	Time
CSV Load	100k rows	80.8 ms
Array → Dataset	100k×10	159 ms
Standard Scaling	100k	3.7 ms

➡️ High-throughput preprocessing pipeline.

🤖 Model Training

Model	Dataset	Time
Decision Tree	2k	203 ms
Random Forest (20 trees)	2k	494 ms
Logistic Regression	2k	15 ms
Gradient Boosting	2k	60 ms

➡️ Competitive training performance for tabular ML.

🧠 Neural Network

Task	Time
Full Training Loop	1.241 s
Inference	113 μs

➡️ Suitable for lightweight deep learning workloads.

🧵 Parallel + SIMD

OpenMP acceleration for large tensors
SIMD kernels for activation functions

Example:

Operation	Size	Time
Sigmoid	10M	11.49 ms
Add	10M	9.70 ms

🤯 Why PHP for Machine Learning?

"Because constraints create innovation."

🔥 The Controversy

Most engineers assume:

PHP = slow ❌
Python = ML ✅

PML challenges that assumption.

💡 Reality Check

PHP + FFI → direct native execution
C backend → same performance class as NumPy/PyTorch CPU
Zero-copy → less memory overhead than Python in many cases

⚡ Where PHP Wins

Tight integration with web stacks
Zero deployment friction (already everywhere)
Predictable memory model vs Python GC quirks

🚫 Where It Doesn’t

GPU ecosystem still immature
Smaller ML community

➡️ PML is not replacing Python — it’s expanding the design space.

⚔️ Comparison (Real Benchmarks)

Operation (1M)	PML	NumPy (est)	PyTorch (CPU est)
Add	1.78 ms	~2.5 ms	~2.0 ms
Multiply	1.28 ms	~2.2 ms	~1.9 ms
ReLU	0.69 ms	~1.8 ms	~1.5 ms
Sigmoid	1.03 ms	~3.0 ms	~2.2 ms
MatMul 512²	5.36 ms	~6–8 ms	~5–7 ms

⚠️ Benchmarks vary by CPU (AVX2/AVX512, cache, threads)

➡️ PML achieves competitive CPU performance, especially in in-place ops.

Feature	PML	PyTorch	NumPy	RubixML
Language	PHP + C	Python + C++	Python + C	PHP
FFI	✅	❌	❌	❌
Zero-copy	✅	⚠️ Partial	❌	❌
SIMD	✅	✅	✅	❌
OpenMP	✅	✅	❌	❌
ML Models	✅	✅	❌	✅
Neural Nets	✅	✅	❌	⚠️ Limited
HPC Design	✅	✅	❌	❌

➡️ PML uniquely combines PHP ergonomics + HPC internals.

🧠 Memory Efficiency

Typical tensor operations: ~3.8 MB peak
Zero-copy dataset slicing
In-place ops significantly reduce allocations

➡️ Designed for low-memory, high-throughput environments

🧪 SIMD Detection (AVX2 / AVX512)

PML can leverage advanced CPU vector instructions when available.

# Linux
lscpu | grep -E "avx2|avx512"

# Or
cat /proc/cpuinfo | grep -i avx

🧠 Runtime Detection (C)

#include <immintrin.h>

int has_avx2() {
    return __builtin_cpu_supports("avx2");
}

int has_avx512() {
    return __builtin_cpu_supports("avx512f");
}

➡️ Kernels automatically switch to best available SIMD path.

🔥 Performance Profiling

Flamegraph Example

perf record -F 99 -g php benchmark.php
perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg

Snapshot Insight

[ tensor_matmul ] ███████████████ 40%
[ tensor_add ]    ███████         15%
[ sigmoid ]       ████            8%
[ php overhead ]  ██              4%

➡️ Most time spent in optimized C kernels (expected).

🔧 Installation

🧪 GitHub Actions (CI)

name: CI

on: [push, pull_request]

jobs:
  build:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v4

      - name: Setup PHP
        uses: shivammathur/setup-php@v2
        with:
          php-version: 8.3
          extensions: ffi

      - name: Install dependencies
        run: composer install --no-interaction

      - name: Build C backend
        run: make

      - name: Run Tests
        run: vendor/bin/phpunit

      - name: Run Benchmarks
        run: vendor/bin/phpbench run

git clone https://github.com/your-repo/pml.git
cd pml

# Build native backend
make

# Install PHP dependencies
composer install

🚀 Quick Example

use Pml\Dataset;
use Pml\Models\LogisticRegression;

$dataset = Dataset::fromCsv('data.csv')
    ->standardize()
    ->split(0.8);

$model = new LogisticRegression();
$model->train($dataset->train());

$predictions = $model->predict($dataset->test());

🔬 Deep Dive: Zero-Copy + Cache Layout

🔧 Internal C Layer Walkthrough

Tensor Struct (Conceptual)

typedef struct {
    float* data;     // contiguous memory
    int* shape;      // dimensions
    int ndim;        // number of dimensions
    int size;        // total elements
} Tensor;

Example: In-place Sigmoid

void tensor_sigmoid_inplace(Tensor* t) {
    for (int i = 0; i < t->size; i++) {
        float x = t->data[i];
        t->data[i] = 1.0f / (1.0f + expf(-x));
    }
}

➡️ No allocation. Direct memory mutation.

Example: FFI Binding (PHP)

$ffi->tensor_sigmoid_inplace($tensor);

➡️ PHP directly calls C → zero overhead abstraction.

Memory Layout Insight

Contiguous Block:
[x1 x2 x3 x4 x5 ...]

➡️ Enables:

SIMD vector loads
Cache line efficiency
Prefetch-friendly execution

🧠 Problem

Traditional PHP ML:

Arrays = scattered memory
Copy-heavy pipelines
Cache misses → slow execution

⚡ Solution (PML)

1. Zero-Copy Design

Data passed by reference across layers
No duplication between PHP ↔ C
Batch slicing = pointer offsets only

2. Cache-Friendly Layout

[B × D × T × N]

B = Batch
D = Features / Embedding
T = Time / Sequence
N = Head / Channel

➡️ Ensures sequential memory access, maximizing CPU cache hits.

3. In-place Operations

x = sigmoid(x)   // no new allocation

➡️ Reduces memory churn + improves throughput.

4. Fused Kernels

loss + gradient → single pass

➡️ Cuts memory bandwidth usage drastically.

📦 Advanced Capabilities

🔁 Zero-copy batch pipelines
⚡ Fused kernels (loss + gradient)
🧵 Parallel tensor ops (OpenMP)
🧠 Cache-optimized layouts for sequence models
📉 Numerical stability (softmax, log, etc.)

⚠️ Known Issues

3 failing assertions in benchmark suite
High variance in some SIMD benchmarks (expected due to CPU scheduling)

🛣️ Roadmap

🔜 Short Term

Fix remaining 3 failing assertions
Improve SIMD variance stability
Expand dataset streaming (GB-scale)

🚀 Mid Term

JIT kernel fusion engine
Memory pool allocator
Advanced optimizers (AdamW, RMSProp)

🌌 Long Term

GPU backend (CUDA / OpenCL)
Transformer / LLM primitives
Distributed training (multi-node)
ONNX import/export

🤝 Contributing

Pull requests are welcome. For major changes, please open an issue first.

📄 Whitepaper

A research-style deep dive is available:

whitepaper.md

HPC design philosophy in PHP
Zero-copy architecture analysis
Benchmark methodology
SIMD + OpenMP strategies
Comparison with Python ML stack

📄 License

MIT License

💡 Final Thought

"PHP was never meant for HPC… until now."

PML pushes PHP beyond its limits — into the domain of high-performance machine learning systems.

🔥 If you like this project, give it a star and push PHP further! Author: Shubham Chaudhary

ghostjat/pml

包简介

关键字：

README 文档