定制 dkm/typo3-robots-guard 二次开发

按需修改功能、优化性能、对接业务系统,提供一站式技术支持

邮箱:yvsm@zunyunkeji.com | QQ:316430983 | 微信:yvsm316

dkm/typo3-robots-guard

Composer 安装命令:

composer require dkm/typo3-robots-guard

包简介

Centrally manages robots.txt, the robots head meta tag and XML sitemap inclusion for any TYPO3 site (single-site or multisite), with environment-aware safety logic that guards production against accidental crawl-blocking while keeping staging/dev fully disallowed.

README 文档

README

Centrally manages robots.txt, the robots head meta tag, and XML sitemap inclusion for any TYPO3 site — single-site or multisite — with environment-aware safety logic whose single purpose is to guard production against accidental crawl-blocking while keeping staging/dev fully disallowed.

  • Composer package: dkm/typo3-robots-guard
  • Extension key: robots_guard
  • TYPO3: v13 LTS and v14 LTS (single codebase)
  • PHP: 8.2 – 8.5

Why "guard"?

Two failure modes are not equal:

  • Accidentally blocking production = an SEO disaster that recovers only slowly.
  • Accidentally allowing staging = minor and quickly recoverable.

So the logic is deliberately asymmetric and fails in the safe direction:

Contextrobots.txtrobots meta
LiveProduction or Production/Live (incl. unset — TYPO3 defaults to bare Production)the allowlist (below)respects per-page SEO
Everything else (Staging, Development, and Production/Upgrade, Production/Testing, …)hard Disallow: / for all agentsforced noindex, nofollow

"Live" is deliberately stricter than TYPO3's isProduction(): only the bare Production context and Production/Live are live. Other Production/* sub-contexts run on the production server but are not the public site (upgrade/test environments), so they are treated as non-live — disallowed and noindexed — which is the safe direction.

An unset TYPO3_CONTEXT landing on bare Production is the safe fail-open direction for crawling; it is caught by monitoring (robots:verify, the backend badge, Search Console), not by blocking. Do not invert this.

Zero-config

Installing the extension activates everything with no further setup:

  • robots.txt is served per site (PSR-15 middleware),
  • the robots meta / X-Robots-Tag policy is applied,
  • environment branching is live,
  • the shipped default robots.txt (below) is the active fallback immediately.

Per-site and installation overrides are optional refinements layered on top — the extension behaves correctly with none of them configured.

One exception — sitemap injection is not zero-config. See Sitemap inclusion below.

Composer-mode install needs a cache rebuild before it takes effect. See Deploying / installing below — "zero-config" means no configuration, not zero deploy steps.

Deploying / installing

The middleware that serves robots.txt is registered via Configuration/RequestMiddlewares.php, and the whole middleware stack (plus the DI container) is *compiled into `var/cache/code/**. In **composer / Production mode TYPO3 does not auto-rebuild that compiled cache when packages change** — so immediately after composer require the new middleware is *not* in the stack yet, and /robots.txt` falls through to an empty/204-style response until the cache is rebuilt. (An empty robots.txt means "all crawlers allowed everywhere" — the fail-open direction, so the window is over-permissive rather than blocking, but it must still be closed.)

So installing/updating this extension on any site is a deploy step, not a drop-in:

composer require dkm/typo3-robots-guard      # first install (composer update on later deploys)
vendor/bin/typo3 cache:flush
# If /robots.txt is still empty, the compiled container/middleware is stale —
# cache:flush does not always rebuild var/cache/code/* in composer mode. Force it:
rm -rf var/cache/code/* && vendor/bin/typo3 cache:warmup
# Confirm it is actually serving (the deploy gate — would have caught the empty window):
vendor/bin/typo3 robots:verify --skip-sitemap

Wiring robots:verify into the root composer.json (see robots:verify) makes this automatic on every deploy — flush, then verify — so a stale-cache empty robots.txt fails the deploy loudly instead of going live unnoticed.

Environment context must be consistent across CLI and web

This extension keys every decision off Environment::getContext() (the TYPO3_CONTEXT env var). CLI and web can resolve different contexts, and when they do, the backend badge, the served robots.txt, and the robots:verify gate disagree about whether the site is live — the gate then validates a different reality than the public site serves.

The usual cause is a .env loader (helhum/dotenv-connector, symfony/dotenv, dkm/dotenv-autoload, …) being non-overwriting by default: a TYPO3_CONTEXT already injected by php-fpm/nginx (fastcgi_param / pool env[…]) wins for web requests, while .env wins for CLI (which has no such injected var). Result: web might read Production/Live while CLI reads Production/Staging.

Make TYPO3_CONTEXT come from a single source on each box. The robust choice is to let .env/.env.local own it, because a .env is read identically by both the web SAPI and the typo3 CLI binary — so CLI and web agree automatically, with no per-command env var and no silent default. (The opposite — server-injected context only — does not reach CLI: php-fpm/nginx env never propagates to the typo3 binary, so an unset CLI falls back to bare Production, which this extension treats as live. That is the silent "staging asserted as live" trap.)

To load .env in both SAPIs, use a loader wired into vendor/autoload.php, e.g.:

  • dkm/dotenv-autoload — loads .env then .env.local via an autoload.files hook (non-overwriting by default; DOTENV_OVERRIDE=1 to let .env.local win).
  • helhum/dotenv-connector — Composer-plugin loader for the same purpose.

Whichever you pick, set TYPO3_CONTEXT there and remove any server-side injection so the two never disagree. (Both loaders are non-overwriting, so if you keep a server var it still wins for web and re-introduces the split.)

robots:verify emits a context advisory on every run to surface this: it warns when an env var is overriding .env in the CLI process, and otherwise reminds you that it sees only the CLI context — the backend badge is authoritative for what web visitors actually get, so confirm the two agree.

The shipped default robots.txt — a deliberate allowlist

The default is default-deny / allowlist: User-agent: * is Disallow: /, and only named bots are allowed. This is intentional (a deliberate GEO trade-off): unlisted bots get no production access. New crawlers must be added explicitly rather than slipping in.

  • Allowed (search & AI-search, which drive traffic): Googlebot, bingbot, facebookexternalhit, Claude-SearchBot, Claude-User, OAI-SearchBot, ChatGPT-User, PerplexityBot.
  • Blocked (AI-training — content opt-out): ClaudeBot, GPTBot, Google-Extended.
  • Allowed bots still have admin/processing paths disallowed (/typo3/, /vendor/, /fileadmin/_processed_/, *tx_solr, *?q=, …).

The template lives at Resources/Private/Templates/robots.txt.

Override hierarchy

Three tiers, highest priority wins. Resolution is explicit and logged — never silent fallthrough (a configured-but-unreadable source logs a warning and falls through; a total miss is reported as SOURCE_UNKNOWN, which robots:verify treats as a failure).

1. Per-site override (highest) — in the site's config.yaml:

robotsGuard:
  # inline content (wins over a path):
  robotsTxt: |
    User-agent: *
    Disallow: /
  # …or a file reference:
  robotsTxtPath: 'EXT:my_sitepackage/Configuration/robots.txt'

2. Installation override — extension configuration installationRobotsTxtPath (Admin Tools → Settings → Extension Configuration → robots_guard): an absolute path or EXT: reference to an installation-wide robots.txt.

3. Shipped default (lowest) — the template above.

The override hierarchy supplies the production allowlist content. In non-production the environment branching overrides all three tiers with the hard Disallow: /.

Robots meta tag

In non-production, indexing is suppressed two ways (defence in depth):

  • X-Robots-Tag: noindex, nofollow response header on every frontend response — the cache-safe guarantee (applies on cache hits too).
  • <meta name="robots" content="noindex,nofollow"> via TYPO3's MetaTagManager on cache-miss page generation — visible parity in page source.

In production both are no-ops, so per-page TYPO3 SEO settings are fully respected.

Sitemap inclusion

The Sitemap: line in robots.txt is owned by the dependency eliashaeussler/typo3-sitemap-robots (EXT:seo provides the actual XML sitemap). We never reimplement injection.

⚠️ This is not zero-config. The dependency only injects the Sitemap: line when the site's config.yaml enables it. Add per production site:

sitemap_robots_inject: all      # all languages  (use 'default' for default language only)

Valid values: all, default. Omitted/empty disables injection. robots:verify emits a WARN for any production site missing this key.

robots:verify — the deploy safety gate

An in-stack check that asserts against the exact content the middleware serves (it shares the resolver — no second source of truth). It parses robots.txt into per-user-agent groups (via spatie/robots-txt) and asserts per agent — naive Disallow: / substring checks are useless against a default-deny file. Returns a non-zero exit code on any failure, so it works as a deploy gate or cron alarm.

vendor/bin/typo3 robots:verify              # full check (makes outbound HTTP for sitemaps)
vendor/bin/typo3 robots:verify --skip-sitemap   # fast gate, no outbound HTTP
vendor/bin/typo3 robots:verify --json            # machine-readable

Production asserts: must-allow agents resolve to allowed-at-root; AI-training agents to Disallow: /; User-agent: * stays default-deny; the meta policy is not forced-noindex; resolution is not SOURCE_UNKNOWN; each site has ≥1 valid sitemap (unless --skip-sitemap). Non-production asserts the inverse: every must-allow agent is disallowed-at-root and the meta policy forces noindex.

Run it in your deploy runbook (this project has no CI). --skip-sitemap is recommended for a fast pre-deploy gate; run the full check periodically.

Wire it into the root composer.json so every install/update flushes the cache (so the freshly-installed middleware is actually active — see Deploying) and then verifies:

"scripts": {
    "robots-guard-verify": [
        "typo3 cache:flush",
        "typo3 robots:verify --skip-sitemap"
    ],
    "post-install-cmd": ["@robots-guard-verify"],
    "post-update-cmd":  ["@robots-guard-verify"]
}

(Composer runs scripts only from the root package, so this is inherently per-project — add it to each site's root composer.json, not the extension's.)

Backend environment badge

For admin backend users, the TYPO3 backend shows a toolbar badge indicating the environment — a calm green ✓ Live! when live, and a loud, gently pulsing red ⚠ DEVELOPMENT / ⚠ PRODUCTION/STAGING … otherwise (the pulse respects prefers-reduced-motion). (Non-admins see nothing.) It is a native toolbar item, so it costs no extra layout height. Clicking it opens a dropdown with a live readout of the actual robots decision (in non-production: robots.txt serving Disallow: / and pages forced to noindex, nofollow; in production: the allowlist is served and per-page SEO is respected).

Configuration-mismatch state (highest priority). Because the backend runs in the web SAPI, it can compare the resolved TYPO3_CONTEXT against what .env/.env.local declares — the exact CLI/web split described above. If they disagree (a server-injected context is overriding .env for web), the badge overrides even the green "Live!" with a red, harder-glowing ⚠ Configuration mismatch, and its dropdown spells out both contexts and the fix. This is the web-side complement to robots:verify's CLI advisory: between them, the split is caught from whichever side it's visible.

Monitoring — what is and isn't in scope

  • In scope (built): robots:verify, the backend banner, the functional/unit tests.
  • Out of scope (operational): Google Search Console is the authoritative external monitor. It is the only thing that catches an edge/infra gap — a CDN or reverse proxy serving a different robots.txt than TYPO3 generates. Add each production property to GSC and watch the robots.txt report and Coverage. This extension deliberately does not build an HTTP-fetch-and-diff prober.

Testing

cd packages/robots_guard
ddev start
ddev composer install
ddev composer test            # cgl + unit + functional
# or individually:
ddev composer test:unit
ddev composer test:functional

v13 / v14 matrix: bump php_version in .ddev/config.yaml and re-run ddev composer update constrained to the target core (^13.4 or ^14.0) — the dependency graph resolves on both (verified: core 13.4.x + testing-framework 8.x; core 14.3.x + testing-framework 9.x).

Manual verification checklist (not auto-tested)

The HTTP/render layer is best confirmed live on DDEV:

  • [ ] curl -s https://<prod-site>/robots.txt → the allowlist; Googlebot allowed.
  • [ ] In a Staging/Development context, /robots.txt → only User-agent: * / Disallow: /.
  • [ ] Non-production page response carries X-Robots-Tag: noindex, nofollow (check headers).
  • [ ] Non-production page source contains <meta name="robots" content="noindex,nofollow">.
  • [ ] With sitemap_robots_inject: all set, the Sitemap: line appears in robots.txt.
  • [ ] Backend toolbar badge: red ⚠ in non-production, green ✓ Live! in production; dropdown shows the robots decision.
  • [ ] [VERIFY] Confirm sitemap-locator:locate --validate exit code on a missing
    sitemap (docs suggest it may exit 0); our gate checks sitemaps **in-process** and
    returns failure itself, so this only affects whether you could also shell out to it.
    

License

GPL-2.0-or-later.

统计信息

  • 总下载量: 0
  • 月度下载量: 0
  • 日度下载量: 0
  • 收藏数: 0
  • 点击次数: 0
  • 依赖项目数: 0
  • 推荐数: 0

GitHub 信息

  • Stars: 0
  • Watchers: 0
  • Forks: 0
  • 开发语言: PHP

其他信息

  • 授权协议: GPL-2.0-or-later
  • 更新时间: 2026-06-23

承接程序开发

PHP开发

VUE

Vue开发

前端开发

小程序开发

公众号开发

系统定制

数据库设计

云部署

网站建设

安全加固