包简介

Tools to extract basic content inventory information from an existing website

README 文档

README

![Build Status](http://gitlab.ttree.ch:8080/buildStatus/icon?job=OSS ContentInsight Master Commit)

This TYPO3 Flow package provider a CLI tools to extract Content Inventory CSV from existing website.

This package is under development and considered beta. This package require Flow 2.3.

Features

Extract website structure and basic meta data
Support crawling presets
Flexible report building (include a CSV report builder, but you can register your own report builder)
Skip URI with regular expression
Sort inventory based on document tree structure

Todos

Generate human readable page ID (like, 1, 1.1, 1.2, 2, 2.1, 2.2, ...)
Update report / multiple index support
Get analytics data from Google Analytics

Configuration

Check the Configuration/Settings.yaml for detailed configurations.

By default, this package cache all Raw HTTP request for one day. You can change this settings in you own Settings.yaml and Caches.yaml.

Base Preset

The base preset is automatically merged with all preset. You can enabled or disabled any property with the settings presets.[preset_name].properties.[property_name].enabled.

Ttree:
  ContentInsight:
    presets:
      '*':
        properties:
          'pageTitle':
            enabled: TRUE
          'navigationTitle':
            enabled: TRUE

Custom Preset

You can define custom preset to crawle different kind of informations. With the class setting you can use your own processor implementation to get information from the current URI. Your processor must implement Ttree\ContentInsight\CrawlerProcessor\ProcessorInterface:

Ttree:
  ContentInsight:
    presets:
      'custom':
        properties:
          'pageTitle':
            class: 'Your\Package\CrawlerProcessor\PageTitleProcessor'
          'metaDescription':
            enabled: TRUE
          'metaKeywords':
            enabled: TRUE
          'firstLevelHeader':
            enabled: TRUE

How to build a report ?

The package support CSV reporting, but you can register your own Report builder. Check the Settings.yaml:

Ttree:
  ContentInsight:
    presets:
      'custom':
        reportConfigurations:
          'csv':
            enabled: TRUE
            renderType: 'Csv'
            renderTypeOptions:
              displayColumnHeaders: TRUE
            reportPath: '%FLOW_PATH_DATA%Reports/Ttree.ContentInsight'
            reportPrefix: 'content-inventory-report'
            properties:
              'id':
                label: 'ID'
              'pageTitle':
                label: 'Page Title'
              'navigationTitle':
                label: 'Navigation Title'
              'externalLink':
                label: 'External Link'
                postProcessor: 'Boolean'
              'currentUri':
                label: 'URL'
              'metaDescription':
                label: 'Meta Description'
              'metaKeywords':
                label: 'Meta Keywords'
              'firstLevelHeaderCount':
                label: 'Main Header Count (H1)'
              'firstLevelHeaderContent':
                label: 'Main Header Content (H1)'
              'remark':
                label: 'Crawling Remark'

The keys in the properties section must match the key produced by the CrawlerProcessor object.

The position of each column could be specified with the following syntax : position: '<position-string>' The <position-string> supports one of the following syntax:

    start (<weight>)
    end (<weight>)
    before <key> (<weight>)
    after <key> (<weight>)
    <numerical-order>

Example

Ttree:
  ContentInsight:
    presets:
      'custom':
        reportConfigurations:
          'csv':
            enabled: TRUE
            renderType: 'Csv'
            renderTypeOptions:
              displayColumnHeaders: TRUE
            reportPath: '%FLOW_PATH_DATA%Reports/Ttree.ContentInsight'
            reportPrefix: 'content-inventory-report'
            properties:
              'id':
                label: 'ID'
                position: '<position-string>',
              'pageTitle':
                label: 'Page Title'
                position:'<position-string>'

For a single crawling preset you can register multiple reports if required. Foreach property you can register a post processor if you need to manipulate the property in the report, see BooleanPostProcessor for a basic example.

How to skip specific URI ?

You can define invalid URIs patterns in your crawling presets:

Ttree:
  ContentInsight:
    presets:
      'custom':
        invalidUriPatterns:
          'javascript':
            pattern: '@^javascript\:void\(0\)$@'
          'mailto':
            pattern: '@^mailto\:.*@'
          'anchor':
            pattern: '@^#.*@'
            message: 'Link to anchor'

If the pattern has a message all URL matching the pattern will be logged. By default the crawler skip those URLs silently.

Usage

To get the complete website inventory:

# flow contentinventor:extract --base-url http://www.domain.com

Or to limit the crawler to a part of the website

# flow contentinventor:extract --base-url http://www.domain.com/products

You can select a crawling presets

# flow contentinventor:extract --base-url http://www.domain.com/products --preset default

ttree/contentinsight 适用场景与选型建议

ttree/contentinsight 是一款基于 PHP 开发的 Composer 扩展包，目前已累计 289 次下载、GitHub Stars 达 2，最近一次更新时间为 2014 年 11 月 13 日，在 PHP 生态内属于活跃度较高的组件。

我们在过去多个企业项目中使用过 ttree/contentinsight 或与其功能相近的方案，如果你在选型或落地过程中遇到问题，例如 版本兼容、二次改造、私有化封装、与内部系统对接、生产 BUG 排查，欢迎联系我们协助评估。

围绕 ttree/contentinsight 我们能提供哪些服务？

定制开发 / 二次开发

基于 ttree/contentinsight 在你已有业务上做功能扩展、字段裁剪、UI 适配、与内部账号 / 权限 / 日志系统的深度对接。

BUG 修复 & 性能优化

线上偶发问题、内存泄漏、慢查询、并发异常等排查修复；针对高流量场景做缓存、队列、索引层面的调优。

项目外包 & 长期维护

承接完整的项目从需求 → 设计 → 开发 → 上线 → 长期运维；也可按月提供技术保姆服务。

yvsm@zunyunkeji.com QQ：316430983 微信：yvsm316 西安尊云信息科技 · 专注 PHP / Go / 分布式系统研发

ttree/contentinsight

包简介