Analyzing Contextualized Embeddings in BERT and other Transformer Models

Timur
5 min readNov 2, 2020

This is my reading note on the EMNLP 2020 long paper “Assessing Phrasal Representation and Composition in Transformers”

Part 2 of the note can be found here.

Short Summary

The paper targets at analyzing two-word phrase representation and composition in a variety of pre-trained state-of-art transformer models. The general idea of the analysis is to evaluate contextualized embeddings based on their alignment with human judgment on phrase-pair similarity.

The conclusions are contextualized embeddings rely heavily on lexical content, missing sophisticated meaning composition.

Composition

To start with, what is “composition” in neural language models? It is a fundamental component of language understanding. It refers to a model’s capacity to combine meaning units into larger units. And the paper makes a further assumption that composed representation should specifically resemble output of the human compositional process.

Example of composition. The composition process combines embedding of individual words into an embedding of the phrase “law school”

Task Setup

The analysis focuses on two-word phrases for the following reasons:

  • These are the smallest phrasal unit
  • The most conducive to lexical controls
  • Leverage larger amounts of annotated phrase similarity data

The motivation of the analysis is that composition effect should be separated from lexical encoding. Specifically, Sophisticated composition should capture meaning beyond lexical content of the phrase.

Example of composition

A good compositional model should be able to produce different embeddings for phrase “law school” versus “school law”. It comes back to the assumption that even with exact input embedding for word law and school, the composed representation should correspond to output of the human composition process. I.e., the embedding for the phrase “law school” should reflect the meaning of the phrase “law school”. Same for “school law”.

The tasks used in the paper are:

  • Similarity correlation
  • Paraphrase classification
  • Landmark experiment (qualitative analysis)

Tasks aim at capturing correspondence of phrase representation with human judgment. In addition to normal tests (with complete dataset), the paper adds controls to the test, removing cues of lexical content.

Similarity Correlation

Correlate representation cosines with human-annotated similarity ratings from bigram relatedness dataset BiRD (Asaadi et al 2019). For each group, it contains a source phrase, paired with multiple target phrases, along with human-annotated scores on how similar these two phrases are. In this test, authors correlate the cosine similarities of the paired phrase representations with human-annotated similarity ratings.

Normal correlation examples in the paper

However, possibility remains that the model is able to infer similarity of phrases based on cues like word overlap, or word content. When evaluating representations, the goal is to tease apart lexical effects versus the composition process.

So a controlled experiment is designed to remove effects of word overlap. In the controlled test, only pairs of “ab-ba” form is present, making models unable to infer similarity scores based on word overlap.

Controlled correlation examples in the paper

Paraphrase Classification

Train a MLP classifier to identify phrase pairs to be paraphrases versus non-paraphrases. This test inspects if representations show better alignment with human judgement if more complicated operations than cosine similarity are used.

Positive pairs are generated by extracting paraphrases from ppdb 2.0, the paraphrase database. Non-paraphrases are collected by randomly sampling from the rest of the dataset.

Normal classification examples

Similar to correlation test, high classification accuracy might result from cues of lexical content. So the controlled test removes superficial information from lexical content. Filtered phrase pairs have exactly 50% word overlap. And Classifiers are trained and tested on the controlled dataset under this setting.

Transformers & Representations

Transformers analyzed in the paper include:

  • BERT (Devlin et al., 2019)
  • RoBERTa (Liu et al., 2019)
  • DistilBERT (Sanh et al., 2019)
  • XLNet (Yang et al., 2019)
  • XLM-RoBERTa (Conneau et al., 2019)

Since transformers maintain representations for every token in every layer, it is difficult to find clear aggregated representations of the phrases. Thus, the paper tests a variety of representation types, and measures layer-wise performance.

Representations examined in the paper include:

  • Avg-Phrase: element-wise averaging of the phrase token representations
  • Head Word: the last word of the phrase. Head word is expected to express the central meaning of the phrase, which could potentially represent the whole phrase. For instance, with phrase “Public service”, ”public” is modifying head word “service”.
  • Avg-All: element-wise averaging of all input token representations
  • CLS: model-dependent. The first token for BERT.
  • SEP: model-dependent. It is generally used to mark the sentence boundary in the input sequence
Representation types

For both correlation and classification tests, the paper experiments with phrase-only input and context available input. For context available input, phrases are embedded in context sentences extracted from Wikipedia dump. For phrase only input, only phrases and special tokens are passed as input.

Due to the length of the note, I will make a part-two note on experimental results and conclusions of the paper.

--

--

Timur

Phd in Computer Science and procrastination. Research scientist @Meta.