ghostjat/dna
Composer 安装命令:
composer require ghostjat/dna
包简介
Description of project DNA.
README 文档
README
Build a multi-class DNA sequence classifier in pure PHP using PHP-ML — from raw data to predictions.
📌 Introduction
This tutorial demonstrates how to use the PHP-ML library to build a machine learning model that classifies DNA sequences into:
- 🦠 Bacteria
- 🐾 Animal
- 🍄 Fungi
- 🧫 Virus
- 🌿 Plant
You’ll go through the complete pipeline:
- Data preparation
- Exploratory Data Analysis (EDA)
- Model training
- Evaluation
- Prediction
All examples are located in:
example/dna/
├── eda.php
├── train.php
└── predict.php
🧪 Problem Overview
DNA sequences contain patterns that can be used to identify their biological origin. Instead of binary promoter detection, this project performs multi-class classification across five organism types.
🤖 Why Machine Learning?
Machine learning helps by:
- Automatically discovering patterns in DNA sequences
- Scaling to large biological datasets
- Providing fast and accurate classification
⚙️ Prerequisites
Ensure you have:
- PHP ≥ 8.2
- Composer
- Install PHP-ML:
composer require ghostjat/pml:*
- Basic command-line knowledge
📂 Dataset Overview
📊 Summary
- Total Samples: 244,447
- Features: 256 (k-mer frequencies)
- Classes: 5
🧬 Classes
- bacteria
- animal
- fungi
- virus
- plant
📁 Storage
datasets/train_*.csv
🔍 Step 1: Exploratory Data Analysis (eda.php)
This script loads and inspects the dataset.
$trainFiles = glob(__DIR__ . '/datasets/train_*.csv'); $dataset = loadDna($trainFiles[0]); for ($i = 1; $i < count($trainFiles); $i++) { $dataset = $dataset->stack(loadDna($trainFiles[$i])); } $df0 = DataFrame::fromCSV($trainFiles[0], false); $cols0 = $df0->columns(); $classes = $df0->categories(end($cols0));
🔎 What it does
- Loads multiple CSV files
- Merges them into one dataset
- Extracts class distribution
🧠 Step 2: Training & Evaluation (train.php)
Train a neural network using MLPClassifier.
$pipeline = new Pipeline( [new NumericStringConverter(), new ZScaleStandardizer()], new MLPClassifier( architecture: [32, 16], epochs: 10, learningRate: 0.01, batchSize: 32 ) ); Dataset::seed(42); $dataset->randomize(); [$train, $val] = $dataset->split(0.8); $pipeline->train($train); $valPreds = $pipeline->predict($val); $valAcc = (new Accuracy())->score($valPreds, $val->labels());
⚡ Training Details
- Train Samples: 195,558
- Validation Samples: 48,889
- Validation Accuracy: ~90.07%
- Training Time: ~20 seconds
🔮 Step 3: Prediction (predict.php)
Use a trained model to classify new DNA sequences.
// ── 1. Load model + class map ───────────────────────────────────────────────── $logger->info('Loading model …'); $pipeline = Pipeline::load($modelDir); $classes = json_decode(file_get_contents($modelDir . '/classes.json'), true); $logger->info('Model loaded', ['classes' => $classes]); // ── 2. Load unknown CSV ─────────────────────────────────────────────────────── $logger->info('Loading unknown data …'); $df = DataFrame::fromCSV($unknownCsv, false); $cols = $df->columns(); // Check if last col is a label (STRING) or a feature (float32) $dtypes = $df->dtypes(); $lastCol = end($cols); $hasLabels = ($dtypes[$lastCol] === 'string'); $X = $df->drop($hasLabels ? [$lastCol] : [])->toTensor(); $dataset = new Dataset($X); $logger->info('Data ready', ['rows' => $dataset->numRows(), 'features' => $dataset->numColumns()]); // ── 3. Predict ──────────────────────────────────────────────────────────────── $logger->info('Predicting …'); $predIndices = $pipeline->predict($dataset)->toFlatArray(); // [N] class indices // ── 4. Evaluate if labels available ────────────────────────────────────────── if ($hasLabels) { $yTrue = $df->castToFloat($lastCol)->col($lastCol)->squeeze(); $predT = \Pml\Tensor::fromArray($predIndices); $acc = (new Accuracy())->score($predT, $yTrue); $logger->info(sprintf('Test accuracy: %.4f (%.2f%%)', $acc, $acc * 100)); }
▶️ Running the Example
🔍 EDA
php eda.php
🧠 Training
php train.php //softmax php trainMLP.php
🔮 Prediction
php predict.php
📊 Interpreting Results
- Accuracy → Overall correctness
- Multi-class Predictions → Output label among 5 classes
🚀 Extending the Tutorial
- Increase epochs for better accuracy
- Try deeper architectures
- Experiment with other classifiers
- Add cross-validation
🏁 Conclusion
You now have a complete workflow for building a multi-class DNA classifier in PHP.
❤️ Final Note
Push PHP beyond traditional limits — even into machine learning.
Happy coding! 🚀
统计信息
- 总下载量: 0
- 月度下载量: 0
- 日度下载量: 0
- 收藏数: 0
- 点击次数: 6
- 依赖项目数: 0
- 推荐数: 0
其他信息
- 授权协议: MIT
- 更新时间: 2026-04-30