dkm/typo3-robots-guard
Composer 安装命令:
composer require dkm/typo3-robots-guard
包简介
Centrally manages robots.txt, the robots head meta tag and XML sitemap inclusion for any TYPO3 site (single-site or multisite), with environment-aware safety logic that guards production against accidental crawl-blocking while keeping staging/dev fully disallowed.
README 文档
README
Centrally manages robots.txt, the robots head meta tag, and XML sitemap inclusion for any TYPO3 site — single-site or multisite — with environment-aware safety logic whose single purpose is to guard production against accidental crawl-blocking while keeping staging/dev fully disallowed.
- Composer package:
dkm/typo3-robots-guard - Extension key:
robots_guard - TYPO3: v13 LTS and v14 LTS (single codebase)
- PHP: 8.2 – 8.5
Why "guard"?
Two failure modes are not equal:
- Accidentally blocking production = an SEO disaster that recovers only slowly.
- Accidentally allowing staging = minor and quickly recoverable.
So the logic is deliberately asymmetric and fails in the safe direction:
| Context | robots.txt | robots meta |
|---|---|---|
Live — Production or Production/Live (incl. unset — TYPO3 defaults to bare Production) | the allowlist (below) | respects per-page SEO |
Everything else (Staging, Development, and Production/Upgrade, Production/Testing, …) | hard Disallow: / for all agents | forced noindex, nofollow |
"Live" is deliberately stricter than TYPO3's isProduction(): only the bare Production
context and Production/Live are live. Other Production/* sub-contexts run on the
production server but are not the public site (upgrade/test environments), so they are
treated as non-live — disallowed and noindexed — which is the safe direction.
An unset TYPO3_CONTEXT landing on bare Production is the safe fail-open direction for
crawling; it is caught by monitoring (robots:verify, the backend badge, Search Console),
not by blocking. Do not invert this.
Zero-config
Installing the extension activates everything with no further setup:
- robots.txt is served per site (PSR-15 middleware),
- the robots meta /
X-Robots-Tagpolicy is applied, - environment branching is live,
- the shipped default robots.txt (below) is the active fallback immediately.
Per-site and installation overrides are optional refinements layered on top — the extension behaves correctly with none of them configured.
One exception — sitemap injection is not zero-config. See Sitemap inclusion below.
Composer-mode install needs a cache rebuild before it takes effect. See Deploying / installing below — "zero-config" means no configuration, not zero deploy steps.
Deploying / installing
The middleware that serves robots.txt is registered via
Configuration/RequestMiddlewares.php, and the whole middleware stack (plus the DI
container) is *compiled into `var/cache/code/**. In **composer / Production mode TYPO3
does not auto-rebuild that compiled cache when packages change** — so immediately after
composer require the new middleware is *not* in the stack yet, and /robots.txt` falls
through to an empty/204-style response until the cache is rebuilt. (An empty robots.txt
means "all crawlers allowed everywhere" — the fail-open direction, so the window is
over-permissive rather than blocking, but it must still be closed.)
So installing/updating this extension on any site is a deploy step, not a drop-in:
composer require dkm/typo3-robots-guard # first install (composer update on later deploys)
vendor/bin/typo3 cache:flush
# If /robots.txt is still empty, the compiled container/middleware is stale —
# cache:flush does not always rebuild var/cache/code/* in composer mode. Force it:
rm -rf var/cache/code/* && vendor/bin/typo3 cache:warmup
# Confirm it is actually serving (the deploy gate — would have caught the empty window):
vendor/bin/typo3 robots:verify --skip-sitemap
Wiring robots:verify into the root composer.json (see
robots:verify) makes this automatic on every
deploy — flush, then verify — so a stale-cache empty robots.txt fails the deploy loudly
instead of going live unnoticed.
Environment context must be consistent across CLI and web
This extension keys every decision off Environment::getContext() (the TYPO3_CONTEXT
env var). CLI and web can resolve different contexts, and when they do, the backend
badge, the served robots.txt, and the robots:verify gate disagree about whether the site
is live — the gate then validates a different reality than the public site serves.
The usual cause is a .env loader (helhum/dotenv-connector, symfony/dotenv,
dkm/dotenv-autoload, …) being non-overwriting by default: a TYPO3_CONTEXT already
injected by php-fpm/nginx (fastcgi_param / pool env[…]) wins for web requests,
while .env wins for CLI (which has no such injected var). Result: web might read
Production/Live while CLI reads Production/Staging.
Make TYPO3_CONTEXT come from a single source on each box. The robust choice is to let
.env/.env.local own it, because a .env is read identically by both the web SAPI and
the typo3 CLI binary — so CLI and web agree automatically, with no per-command env var and no
silent default. (The opposite — server-injected context only — does not reach CLI: php-fpm/nginx
env never propagates to the typo3 binary, so an unset CLI falls back to bare Production, which
this extension treats as live. That is the silent "staging asserted as live" trap.)
To load .env in both SAPIs, use a loader wired into vendor/autoload.php, e.g.:
dkm/dotenv-autoload— loads.envthen.env.localvia anautoload.fileshook (non-overwriting by default;DOTENV_OVERRIDE=1to let.env.localwin).helhum/dotenv-connector— Composer-plugin loader for the same purpose.
Whichever you pick, set TYPO3_CONTEXT there and remove any server-side injection so the two
never disagree. (Both loaders are non-overwriting, so if you keep a server var it still wins for
web and re-introduces the split.)
robots:verify emits a context advisory on every run to surface this: it warns when an
env var is overriding .env in the CLI process, and otherwise reminds you that it sees only
the CLI context — the backend badge is authoritative for what web visitors actually get,
so confirm the two agree.
The shipped default robots.txt — a deliberate allowlist
The default is default-deny / allowlist: User-agent: * is Disallow: /, and only
named bots are allowed. This is intentional (a deliberate GEO trade-off): unlisted bots
get no production access. New crawlers must be added explicitly rather than slipping in.
- Allowed (search & AI-search, which drive traffic):
Googlebot,bingbot,facebookexternalhit,Claude-SearchBot,Claude-User,OAI-SearchBot,ChatGPT-User,PerplexityBot. - Blocked (AI-training — content opt-out):
ClaudeBot,GPTBot,Google-Extended. - Allowed bots still have admin/processing paths disallowed (
/typo3/,/vendor/,/fileadmin/_processed_/,*tx_solr,*?q=, …).
The template lives at Resources/Private/Templates/robots.txt.
Override hierarchy
Three tiers, highest priority wins. Resolution is explicit and logged — never silent
fallthrough (a configured-but-unreadable source logs a warning and falls through; a total
miss is reported as SOURCE_UNKNOWN, which robots:verify treats as a failure).
1. Per-site override (highest) — in the site's config.yaml:
robotsGuard:
# inline content (wins over a path):
robotsTxt: |
User-agent: *
Disallow: /
# …or a file reference:
robotsTxtPath: 'EXT:my_sitepackage/Configuration/robots.txt'
2. Installation override — extension configuration installationRobotsTxtPath
(Admin Tools → Settings → Extension Configuration → robots_guard): an absolute path or
EXT: reference to an installation-wide robots.txt.
3. Shipped default (lowest) — the template above.
The override hierarchy supplies the production allowlist content. In non-production the environment branching overrides all three tiers with the hard
Disallow: /.
Robots meta tag
In non-production, indexing is suppressed two ways (defence in depth):
X-Robots-Tag: noindex, nofollowresponse header on every frontend response — the cache-safe guarantee (applies on cache hits too).<meta name="robots" content="noindex,nofollow">via TYPO3'sMetaTagManageron cache-miss page generation — visible parity in page source.
In production both are no-ops, so per-page TYPO3 SEO settings are fully respected.
Sitemap inclusion
The Sitemap: line in robots.txt is owned by the dependency
eliashaeussler/typo3-sitemap-robots
(EXT:seo provides the actual XML sitemap). We never reimplement injection.
⚠️ This is not zero-config. The dependency only injects the
Sitemap:line when the site'sconfig.yamlenables it. Add per production site:sitemap_robots_inject: all # all languages (use 'default' for default language only)Valid values:
all,default. Omitted/empty disables injection.robots:verifyemits a WARN for any production site missing this key.
robots:verify — the deploy safety gate
An in-stack check that asserts against the exact content the middleware serves (it shares
the resolver — no second source of truth). It parses robots.txt into per-user-agent
groups (via spatie/robots-txt) and asserts per agent — naive Disallow: / substring
checks are useless against a default-deny file. Returns a non-zero exit code on any
failure, so it works as a deploy gate or cron alarm.
vendor/bin/typo3 robots:verify # full check (makes outbound HTTP for sitemaps)
vendor/bin/typo3 robots:verify --skip-sitemap # fast gate, no outbound HTTP
vendor/bin/typo3 robots:verify --json # machine-readable
Production asserts: must-allow agents resolve to allowed-at-root; AI-training agents to
Disallow: /; User-agent: * stays default-deny; the meta policy is not forced-noindex;
resolution is not SOURCE_UNKNOWN; each site has ≥1 valid sitemap (unless --skip-sitemap).
Non-production asserts the inverse: every must-allow agent is disallowed-at-root and the
meta policy forces noindex.
Run it in your deploy runbook (this project has no CI). --skip-sitemap is recommended
for a fast pre-deploy gate; run the full check periodically.
Wire it into the root composer.json so every install/update flushes the cache (so the
freshly-installed middleware is actually active — see Deploying)
and then verifies:
"scripts": {
"robots-guard-verify": [
"typo3 cache:flush",
"typo3 robots:verify --skip-sitemap"
],
"post-install-cmd": ["@robots-guard-verify"],
"post-update-cmd": ["@robots-guard-verify"]
}
(Composer runs scripts only from the root package, so this is inherently per-project —
add it to each site's root composer.json, not the extension's.)
Backend environment badge
For admin backend users, the TYPO3 backend shows a toolbar badge indicating the
environment — a calm green ✓ Live! when live, and a loud, gently pulsing red
⚠ DEVELOPMENT / ⚠ PRODUCTION/STAGING … otherwise (the pulse respects
prefers-reduced-motion). (Non-admins see nothing.) It is a native toolbar item, so it costs no extra layout height.
Clicking it opens a dropdown with a live readout of the actual robots decision (in
non-production: robots.txt serving Disallow: / and pages forced to noindex, nofollow;
in production: the allowlist is served and per-page SEO is respected).
Configuration-mismatch state (highest priority). Because the backend runs in the web SAPI,
it can compare the resolved TYPO3_CONTEXT against what .env/.env.local declares — the
exact CLI/web split described above.
If they disagree (a server-injected context is overriding .env for web), the badge overrides
even the green "Live!" with a red, harder-glowing ⚠ Configuration mismatch, and its dropdown
spells out both contexts and the fix. This is the web-side complement to robots:verify's CLI
advisory: between them, the split is caught from whichever side it's visible.
Monitoring — what is and isn't in scope
- In scope (built):
robots:verify, the backend banner, the functional/unit tests. - Out of scope (operational): Google Search Console is the authoritative external monitor. It is the only thing that catches an edge/infra gap — a CDN or reverse proxy serving a different robots.txt than TYPO3 generates. Add each production property to GSC and watch the robots.txt report and Coverage. This extension deliberately does not build an HTTP-fetch-and-diff prober.
Testing
cd packages/robots_guard
ddev start
ddev composer install
ddev composer test # cgl + unit + functional
# or individually:
ddev composer test:unit
ddev composer test:functional
v13 / v14 matrix: bump php_version in .ddev/config.yaml and re-run
ddev composer update constrained to the target core (^13.4 or ^14.0) — the
dependency graph resolves on both (verified: core 13.4.x + testing-framework 8.x;
core 14.3.x + testing-framework 9.x).
Manual verification checklist (not auto-tested)
The HTTP/render layer is best confirmed live on DDEV:
- [ ]
curl -s https://<prod-site>/robots.txt→ the allowlist;Googlebotallowed. - [ ] In a
Staging/Developmentcontext,/robots.txt→ onlyUser-agent: * / Disallow: /. - [ ] Non-production page response carries
X-Robots-Tag: noindex, nofollow(check headers). - [ ] Non-production page source contains
<meta name="robots" content="noindex,nofollow">. - [ ] With
sitemap_robots_inject: allset, theSitemap:line appears in robots.txt. - [ ] Backend toolbar badge: red ⚠ in non-production, green ✓ Live! in production; dropdown shows the robots decision.
- [ ] [VERIFY] Confirm
sitemap-locator:locate --validateexit code on a missingsitemap (docs suggest it may exit 0); our gate checks sitemaps **in-process** and returns failure itself, so this only affects whether you could also shell out to it.
License
GPL-2.0-or-later.
统计信息
- 总下载量: 0
- 月度下载量: 0
- 日度下载量: 0
- 收藏数: 0
- 点击次数: 0
- 依赖项目数: 0
- 推荐数: 0
其他信息
- 授权协议: GPL-2.0-or-later
- 更新时间: 2026-06-23