ghostjat/dna 问题修复 & 功能扩展

解决BUG、新增功能、兼容多环境部署,快速响应你的开发需求

邮箱:yvsm@zunyunkeji.com | QQ:316430983 | 微信:yvsm316

ghostjat/dna

Composer 安装命令:

composer require ghostjat/dna

包简介

Description of project DNA.

README 文档

README

Build a multi-class DNA sequence classifier in pure PHP using PHP-ML — from raw data to predictions.

📌 Introduction

This tutorial demonstrates how to use the PHP-ML library to build a machine learning model that classifies DNA sequences into:

  • 🦠 Bacteria
  • 🐾 Animal
  • 🍄 Fungi
  • 🧫 Virus
  • 🌿 Plant

You’ll go through the complete pipeline:

  • Data preparation
  • Exploratory Data Analysis (EDA)
  • Model training
  • Evaluation
  • Prediction

All examples are located in:

example/dna/
├── eda.php
├── train.php
└── predict.php

🧪 Problem Overview

DNA sequences contain patterns that can be used to identify their biological origin. Instead of binary promoter detection, this project performs multi-class classification across five organism types.

🤖 Why Machine Learning?

Machine learning helps by:

  • Automatically discovering patterns in DNA sequences
  • Scaling to large biological datasets
  • Providing fast and accurate classification

⚙️ Prerequisites

Ensure you have:

  • PHP ≥ 8.2
  • Composer
  • Install PHP-ML:
composer require ghostjat/pml:*
  • Basic command-line knowledge

📂 Dataset Overview

📊 Summary

  • Total Samples: 244,447
  • Features: 256 (k-mer frequencies)
  • Classes: 5

🧬 Classes

  • bacteria
  • animal
  • fungi
  • virus
  • plant

📁 Storage

datasets/train_*.csv

🔍 Step 1: Exploratory Data Analysis (eda.php)

This script loads and inspects the dataset.

$trainFiles = glob(__DIR__ . '/datasets/train_*.csv');
$dataset = loadDna($trainFiles[0]);

for ($i = 1; $i < count($trainFiles); $i++) {
    $dataset = $dataset->stack(loadDna($trainFiles[$i]));
}

$df0 = DataFrame::fromCSV($trainFiles[0], false);
$cols0 = $df0->columns();
$classes = $df0->categories(end($cols0));

🔎 What it does

  • Loads multiple CSV files
  • Merges them into one dataset
  • Extracts class distribution

🧠 Step 2: Training & Evaluation (train.php)

Train a neural network using MLPClassifier.

$pipeline = new Pipeline(
    [new NumericStringConverter(), new ZScaleStandardizer()],
    new MLPClassifier(
        architecture: [32, 16],
        epochs: 10,
        learningRate: 0.01,
        batchSize: 32
    )
);

Dataset::seed(42);
$dataset->randomize();

[$train, $val] = $dataset->split(0.8);

$pipeline->train($train);

$valPreds = $pipeline->predict($val);
$valAcc = (new Accuracy())->score($valPreds, $val->labels());

⚡ Training Details

  • Train Samples: 195,558
  • Validation Samples: 48,889
  • Validation Accuracy: ~90.07%
  • Training Time: ~20 seconds

🔮 Step 3: Prediction (predict.php)

Use a trained model to classify new DNA sequences.

// ── 1. Load model + class map ─────────────────────────────────────────────────
$logger->info('Loading model …');
$pipeline = Pipeline::load($modelDir);
$classes  = json_decode(file_get_contents($modelDir . '/classes.json'), true);
$logger->info('Model loaded', ['classes' => $classes]);

// ── 2. Load unknown CSV ───────────────────────────────────────────────────────
$logger->info('Loading unknown data …');
$df   = DataFrame::fromCSV($unknownCsv, false);
$cols = $df->columns();

// Check if last col is a label (STRING) or a feature (float32)
$dtypes      = $df->dtypes();
$lastCol     = end($cols);
$hasLabels   = ($dtypes[$lastCol] === 'string');

$X       = $df->drop($hasLabels ? [$lastCol] : [])->toTensor();
$dataset = new Dataset($X);
$logger->info('Data ready', ['rows' => $dataset->numRows(), 'features' => $dataset->numColumns()]);

// ── 3. Predict ────────────────────────────────────────────────────────────────
$logger->info('Predicting …');
$predIndices = $pipeline->predict($dataset)->toFlatArray();   // [N] class indices

// ── 4. Evaluate if labels available ──────────────────────────────────────────
if ($hasLabels) {
    $yTrue = $df->castToFloat($lastCol)->col($lastCol)->squeeze();
    $predT = \Pml\Tensor::fromArray($predIndices);
    $acc   = (new Accuracy())->score($predT, $yTrue);
    $logger->info(sprintf('Test accuracy: %.4f  (%.2f%%)', $acc, $acc * 100));
}

▶️ Running the Example

🔍 EDA

php eda.php

🧠 Training

php train.php  //softmax

php trainMLP.php

🔮 Prediction

php predict.php

📊 Interpreting Results

  • Accuracy → Overall correctness
  • Multi-class Predictions → Output label among 5 classes

🚀 Extending the Tutorial

  • Increase epochs for better accuracy
  • Try deeper architectures
  • Experiment with other classifiers
  • Add cross-validation

🏁 Conclusion

You now have a complete workflow for building a multi-class DNA classifier in PHP.

❤️ Final Note

Push PHP beyond traditional limits — even into machine learning.

Happy coding! 🚀

统计信息

  • 总下载量: 0
  • 月度下载量: 0
  • 日度下载量: 0
  • 收藏数: 0
  • 点击次数: 6
  • 依赖项目数: 0
  • 推荐数: 0

GitHub 信息

  • Stars: 0
  • Watchers: 0
  • Forks: 0
  • 开发语言: PHP

其他信息

  • 授权协议: MIT
  • 更新时间: 2026-04-30

承接程序开发

PHP开发

VUE

Vue开发

前端开发

小程序开发

公众号开发

系统定制

数据库设计

云部署

网站建设

安全加固