Skip to content
随便写写
Go back

Design Google Translate

Updated:

Google Translate System Design


Introduction

Google Translate is a widely used language-translation service offered by Google. Powered by machine-learning (ML) models, it translates text between more than 130 languages and serves over a billion users as of 2024.

Google Translate Overview


Clarifying Requirements

  1. Real-Time translation vs Batch Translation (Model Architecture)
  2. Text vs Audio vs Visual (Multi-modal)
  3. Cloud vs On-Device (Model size, Inference Optimization)
  4. Bilingual vs Multilingual

To simplify the problem, we will limit the scope to Batch, Multilingual, Text and Cloud translation system.


Frame the Problem as an ML Task

Specifying the System’s Input and Output

Choosing a Suitable ML Approach

Language translation is a classic sequence-to-sequence (seq2seq) task. Modern systems favor Transformer-based encoder-decoder models:

  1. Encoder – Converts the source sentence into contextual vectors.
  2. Decoder – Generates the target sentence token-by-token, attending to the encoder’s output.

Why Transformers?

Other models Choices


Data Preparation

Two data types feed the model:

  1. General data – Large-scale multilingual text from the internet.
  2. Translation data – ≈ 300 million sentence pairs (source + target).

Preparation focuses on translation data.

1. Primary Data: Parallel Corpora

This is the most crucial data, consisting of texts that are direct, sentence-by-sentence translations of each other. It forms the core training material for the model.

2. Data Expansion and Augmentation

Because high-quality parallel data is limited, several techniques are used to create more training examples:

1 · Text Pre-processing


Preprocessing Steps — Modern Translation Pretraining

🔎 1️⃣ Data Cleaning & Deduplication

Task

Purpose

Remove noisy sentences

Drop sentence pairs with very long or very short sequences, or mismatched alignments.

Filter profanity / sensitive content

Ensure safe outputs.

Deduplicate

Remove duplicate sentence pairs and repeated monolingual data (web crawl has lots of duplication).

Script normalization

Normalize Unicode (NFC/NFD), convert different scripts consistently (e.g., Simplified ↔ Traditional Chinese).

Why: Reduces noise → improves model generalization.


🔎 2️⃣ Language Identification & Tagging

Task

Purpose

Language detection

Auto-identify language of monolingual data (and confirm for parallel pairs).

Assign language tags

E.g., add >>fr<< or >>zh<< to source text so the model knows which language to translate into.

Why: Enables multilingual pretraining and zero-shot/few-shot transfer.


🔎 3️⃣ Tokenization

Task

Purpose

Subword tokenization (SentencePiece, BPE, WordPiece)

Split words into common subword units to handle rare/unseen words.

Multilingual vocabulary

Train a shared tokenizer across all languages.

Why: Handles vocabulary for hundreds of languages in a scalable way.


🔎 4️⃣ Alignment Verification (Parallel Data)

Task

Purpose

Length ratio checks

Filter out sentence pairs with extreme length mismatches.

Translation quality scoring (optional)

Use tools like LASER or BLEU filtering to keep only high-quality sentence pairs.

Why: Ensures that paired data teaches the model correct alignments.


🔎 5️⃣ Corruption / Masking (for Monolingual)

Task

Purpose

Noise injection (shuffling, masking, deletion)

Prepare monolingual data for Denoising Auto-Encoding (DAE) or MLM tasks.

Why: Teaches encoder-decoder models to recover from corrupted inputs → boosts robustness and learning.


🔎 6️⃣ Sampling & Balancing

Task

Purpose

Upsampling low-resource languages

Increase frequency of rare languages in the training stream.

Downsampling high-resource languages

Prevent overfitting to dominant languages like English or Spanish.

Why: Balances the multilingual training data.


🔎 7️⃣ Back-Translation (for Monolingual → Synthetic Parallel)

Task

Purpose

Translate monolingual target sentences back into source languages

Create synthetic parallel data when human-translated data is missing.

Why: Expands training data for low-resource pairs.


2 · Text Tokenization

Word-level vocabularies explode in size across 130+ languages, so we use sub-word tokenization—specifically Byte-Pair Encoding (BPE) .

Example token-to-ID mapping:

Token

ID

<BOS>

0

<EOS>

1

walking

2

bonjour

3

hello

4

fantastique

5


1. Byte-Pair Encoding (BPE)

Used in: GPT-2, RoBERTa

Example:
Training corpus: "low lower newest widest"

  1. Start with chars: l o w, l o w e r, etc.
  2. Find most common pair (e.g., 'e' + 'r''er'), merge into 'er'
  3. Repeat until vocab size is met.

Tokenize "lower"['low', 'er'] if 'low' and 'er' exist in vocab.


2. Byte-Level BPE (BBPE)

Used in: GPT-3, GPT-4

Example:
Input: "apple 🍎"

  1. Break into raw bytes: ['a', 'p', 'p', 'l', 'e', ' ', '🍎']
  2. Apply merges on bytes to form subwords: ['app', 'le', ' ', '🍎'] (assuming 'app' and 'le' exist)

3. Unigram Language Model

Used in: SentencePiece (T5, ALBERT)

Example: "international"
May be tokenized as ['inter', 'national'] or ['intern', 'ation', 'al'] depending on highest probability.


4. WordPiece

Used in: BERT

Example: "unhappiness"
['un', '##happiness'] or ['un', '##hap', '##pi', '##ness'] depending on vocab.


Summary Table:

Algorithm

Used In

Key Feature

Handles OOV?

Probabilistic?

BPE

GPT-2, RoBERTa

Merges frequent char pairs

Yes

No

Byte-level BPE

GPT-3, GPT-4

Merges on byte sequences

Yes

No

WordPiece

BERT

Greedy merging with LM scoring

Yes

No

Unigram LM

T5, ALBERT

Picks best subwords probabilistically

Yes

Yes

Let me know if you’d like to visualize token merges step by step.


Model Development

Architecture Overview

Encoder-decoder Transformer components:

Component

Encoder

Decoder

Token Embedding

Positional Encoding

Self-Attention

Full (bi-directional)

Masked (causal)

Cross-Attention

Prediction Head

Key differences:

  1. Cross-Attention Layer – Decoder attends to encoder outputs.
  2. Masked Self-Attention – Decoder can’t see future tokens.
  3. Prediction Head – Linear + soft-max layer produces token probabilities.

Other key information:

Training Strategy by Alex

Stage

Data

Objective

Loss

Unsupervised Pre-training

Multilingual general text

Masked Language Modeling (MLM)

Cross-entropy

Supervised Finetuning

Parallel sentence pairs (bilingual)

Next-token prediction

Cross-entropy

ML Objective & Loss Function — Revised ✍️

Below is a tighter, more technically precise version of the “ML objective and loss function” section, organised to show why each design choice matters when training a Neural Machine Translation (NMT) system.


1 | Pre-training: Learning Multilingual Representations

Data
We keep the multilingual portions of large-scale corpora (C4, Wikipedia, Stack Exchange, Common-Crawl, etc.) and drop only the languages we never intend to translate. This maximises coverage while avoiding label noise from irrelevant scripts.

ObjectiveMasked Language Modeling (MLM) with span corruption

1
2

Input : The quick brown jumps over the dog .
Target : fox lazy

Loss — Cross-entropy over only the masked positions

with optional label smoothing (ε ≈ 0.1) for better generalisation.


2 | Supervised Fine-tuning: Turning a General LM into a Translator

With sentence-aligned pairs :

  1. Encoder ingests the full source sentence .
  2. Decoder is trained with teacher forcing: it receives the gold target prefix and predicts the next token .
  3. Loss — Token-level cross-entropy on all target positions

plus the same label-smoothing trick.


Training Strategy for Modern Multilingual NMT Systems

Modern multilingual Neural Machine Translation (NMT) models, such as those used in services like Google Translate, follow a multi-stage training process to achieve high performance across many language pairs. The pipeline typically includes the following stages:


1. Pretraining (Cross-Lingual Language Modeling)

Purpose:
Learn universal language patterns and cross-lingual representations before any translation-specific training.

Data Used:
Large-scale monolingual corpora from many languages, often mined from web sources.

Training Objectives:

Model Behavior:
The model learns general linguistic features across languages, which helps especially in low-resource scenarios.


2. Supervised Fine-Tuning (Translation Modeling)

Purpose:
Teach the model how to translate between languages using sentence-aligned parallel corpora.

Data Used:
Large-scale parallel corpora, mostly English-centric but also covering non-English pairs. Synthetic data (e.g., back-translated text) is also used to supplement low-resource languages.

Training Strategy:

Augmentation Techniques:

Zero-shot Capability:
A well-trained multilingual model can often translate between unseen language pairs by transferring knowledge from related pairs.


3. Domain Adaptation (Specialized Fine-Tuning)

Purpose:
Improve translation quality in specific domains like healthcare, law, or conversational text.

Data Used:
Smaller, domain-specific parallel corpora, sometimes supplemented by synthetic domain data via back-translation.

Training Strategy:



End-to-End Summary

Multilingual NMT systems are trained in stages—starting with monolingual self-supervised learning, followed by supervised translation training using real and synthetic parallel data, and optionally domain-specific fine-tuning. The training pipeline is carefully designed to handle large-scale, multilingual data while balancing quality across languages and domains. Ongoing advances in architecture, data quality, and LLM integration continue to push translation quality and coverage.


2. Training Data Collection

A. Parallel Corpora (Supervised)

B. Monolingual Corpora (Unsupervised/Pretraining)

C. Quality Filtering Techniques

D. Data Balancing


Sampling Strategies in Generative Models

  1. Deterministic methods (e.g., greedy search, beam search)
  2. Stochastic methods (e.g., multinomial sampling, top-k, top-p)

In this chapter, we choose beam search for two key reasons:

In contrast, stochastic sampling methods are better suited for tasks where diversity and creativity are desired, such as storytelling or dialogue generation.


Comparison: Deterministic vs. Stochastic Sampling

Characteristic

Deterministic Methods

Stochastic Methods

Approach

Follows a predictable decision process

Samples from a probability distribution

Efficiency

Less efficient (tracks multiple paths)

More efficient due to randomness

Output Quality

Coherent and predictable

Diverse and imaginative

Risk

May produce repetitive sequences

May generate inappropriate or off-topic text

Best For

Tasks requiring consistency (e.g. translation)

Tasks requiring creativity (e.g. story generation)

Common Methods

Greedy search, beam search

Multinomial sampling, top-k, top-p sampling


Evaluation

Offline Metrics

Key Metrics:

Human Evaluation

LLM as a Judge

The typical setup involves prompting an LLM with the source sentence, the candidate translation(s), and optionally a reference translation. The LLM is then instructed (via a carefully designed prompt) to perform tasks such as:

LLMs can handle multilingual input and adapt to complex linguistic nuances, making them especially valuable for evaluating low-resource or morphologically rich languages where human evaluation is expensive or infeasible.

Companies often use few-shot or zero-shot prompting, and sometimes fine-tune LLMs with supervised data (e.g., human-labeled translation comparisons) to improve consistency and alignment with human judgments. These LLM-based evaluations correlate more closely with human ratings than BLEU scores and offer significant efficiency and scalability benefits.

For example:

Online Metrics


Overall ML System Design

1 · Language Detector

An encoder-only Transformer classifies the input language via:

Language Detector

2 · Translation Service

  1. Detect language.
  2. Route to correct bilingual model.
  3. Perform beam search decoding.
  4. Detokenize and return text.

Other Talking Points

If time remains, discuss:


Real-Time Translation:

Replace Full-Sequence Encoder-Decoder with Streaming Models:

Architectural Options:

Strategy

Description

Example

Chunk-based Transformer

Break input into fixed-size overlapping windows

SimulTrans, STACL

Monotonic Attention

Enforces left-to-right or wait-k policy

MoChA, Wait-k Transformer

Transducer models

Encoder-decoder-decoder (non-attention based)

RNN-T, Recurrent Neural Aligner

Streaming Conformer (for speech)

Use causal attention and limited context for audio

Google’s real-time ST model

🛠 Techniques to Apply:


On-Device Translation:

Replace Heavy Transformer with Lightweight, Quantized, and Distilled Models:

Architectural Options:

Model

Type

Notes

MobileBERT, TinyBERT

Lightweight Transformers

Good encoder for on-device NMT

DistilBART, DistilmBART

Distilled encoder-decoder

Smaller decoder with similar quality

mBART + pruning/quantization

Multilingual

Use only specific heads/layers

FNet

Replaces self-attn with Fourier Transform

Super fast & small

Linear Attention Models (Performer, Linformer)

Approximate attention for low-memory devices

Good for constrained decoding

🛠 Key Optimization Techniques:



Share this post on:

Next Post
Minimax Speech 2.0