Back to Journal
communityen

Moderation: AI-first, human-final

Spam + policy + tone classifiers in sub-500ms, auto-allow >0.85, auto-flag <0.6, human queue in between. Target: 92% auto, 8% human, 0 false-allows.

Moderation architecture is one of the hardest parts of a B2B community platform. On one hand, we can't manually review every post (doesn't scale). On the other, we can't let an LLM decide alone on a contractual customer channel. The compromise we landed on: AI-first, human-final.

The pre-publish moderation pipeline

Before a post or comment goes public, three classifiers run on it:

  1. Spam classifier (~sub-100ms). A lightweight fastText model targeting obvious spam patterns: links to unknown domains, repetitive content, copy-paste detection across threads.

  2. Policy-violation classifier (~sub-200ms). A heavier fine-tuned transformer trained on the community policy: abusive language, competitor-vendor commentary, customer-data leaks (phone number, email, IBAN pattern matching).

  3. Tone check (~200ms). A zero-shot classifier that inspects the post's sentiment: extreme emotion, offended, passive-aggressive register. Not always a block; often just a nudge back to the poster ("are you sure you want to publish this without softening the tone?").

All three run in parallel after the publish-button press. End-to-end latency is <500ms, so there's no perceptible UI lag.

The confidence thresholds

Classifier output is a 0-1 confidence score. The decision rules:

  • >0.85 confidence: auto-allow (post publishes immediately), categorised in the log file for later human audit.
  • <0.6 confidence: auto-flag (post hidden, only the poster sees an "under review" badge), forced into human queue.
  • 0.6-0.85: human queue, but the post is provisionally published in the meantime. If a human moderator rejects within 4 hours, the post is removed; if approved, nothing further happens.

The last rule ("publish-while-reviewing") is deliberate: the 0.6-0.85 band has a low base rate of actually being a violation, and 4 hours of provisional publication is less harmful than 4 hours of hold-everything backlog.

The first-month target

In the first month after launch we're aiming for:

  • 92% auto-decisions. 92% of posts require no human decision (both sides: auto-allow >0.85 + auto-flag <0.6).
  • 8% human queue. The remaining 8% sits in the 0.6-0.85 band where the human moderator decides.
  • 0 false-allows for policy violations. No policy-violating post should slip through auto-allow. This is the most important metric; a single runaway incident damages an 80-person design-partner relationship more than the saving from auto-decisions buys back.

If the false-allow count climbs (even to 1 case), we immediately tighten auto-allow from 0.85 to 0.9 and route everything in the 0.85-0.9 band to the human queue until the next retraining cycle.

The one thing we won't trust the AI on

Legal or regulatory references always go to human review. If a post or KB article cites a law (NAV rule, GDPR article, ESPD/EUSPI tender procedure), it auto-routes to human review regardless of classifier confidence.

Two reasons: (a) legal text accuracy is critical, and an LLM hallucination (invented paragraph number, quoting a real but revoked rule) reaches the reader as a legitimate reference. (b) the legal disclaimers in the tenant contract make vendor liability harder to bound when wrong legal advice goes out from our platform, regardless of whether the poster was a peer user.

A simple keyword detector in the post body looks for patterns like NAV §, GDPR Article, Tvr., Korm. r., Tt., etc., and any hit sends the post straight to the human queue regardless of other classifier scores.