Analyzing Contextualized Embeddings in BERT and other Transformer Models (Pt 2)

4 min readNov 4, 2020

This is my second reading note on the EMNLP 2020 long paper “Assessing Phrasal Representation and Composition in Transformers”

Part 1 can be found here. Part 1 explains task setup, motivation and other details in the paper.

Short Summary

The paper targets at analyzing two-word phrase representation and composition in a variety of pre-trained state-of-art transformer models. The general idea of the analysis is to evaluate contextualized embeddings based on their alignment with human judgment on phrase-pair similarity.

The conclusions are contextualized embeddings rely heavily on lexical content, missing sophisticated meaning composition.

Experiment Results

Similarity Correlation

Correlation result with phrase-only input.

Again, the paper measures 5 different representation types. The first row shows correlation results on the full BiRD dataset for all models, layers, and representation types with phrase-only inputs. X-axis shows layer indexes, and Y-axis is the correlation value. Among representation types, Avg-Phrase and Avg-All consistently achieve the highest correlations across models and layers. Layer-wise, Avg-All and Avg-Phrase peak at layer 1 except in distil, whereas CLS requires more layers to peak. Model-wise, XLM-RoBERTa weakest, potentially due to the fact that it is trained to handle multiple languages. BERT retains fairly consistent correlations across layers, while RoBERTa and XLNet show rapid declines as layers progress.

The second row shows models’ performance on controlled dataset. Note that the overall correlation is very low compared to the performance on full dataset. And the Y-axis is on a different scale. With AB-BA test, the paper examines the extent to which the above correlations indicate sophisticated phrasal composition versus effective encoding of information about phrases’ component words. It is observed that performance of all models drops significantly. Avg-All and Avg-Phrase no longer dominate the correlations, suggesting that these representations capture word information, but not higher-level compositional information. Notably, CLS tokens in RoBERTa and DistilBERT show relative strong correlations in later layers, suggesting correspondence to the compositional signal.

Correlation on BiRD dataset with phrases embedded in sentence context (context-available input setting).

The table above shows the correlation when phrases are embedded in the sentence. With full dataset, Avg-Phrase is now consistently the highest in correlation, and the correlation no longer drops dramatically in later layers. In the controlled setting, the presence of context does boost overall correlation . However, it’s still the case that correlation degrades significantly compared to full dataset. This indicates that even with context available, phrase representations in transformers reflect heavy reliance on word content.

Paraphrase Classification

Classification accuracy on PPDB dataset (phrase-only input setting).

Moving to the results from paraphrase classification. First row shows results from full paraphrase classification dataset, where Y-axis corresponds to the classification accuracy. Accuracies are overall very high, and the result shows generally similar patterns to the correlation tasks. Best accuracy is achieved by using Avg-Phrase and Avg-All representations. Performance of CLS requires a few more layers to peak. RoBERTa, XLM-RoBERTa, and XLNet show decreasing accuracies in later layers, while BERT and DistilBERT remain more consistent across layers.

Classification accuracy on PPDB dataset with phrases embedded in sentence context.

However, it is found that classification accuracies are also inflated by cues of word overlap. The bottom row shows classification accuracy when word overlap is held constant. Consistent with controlled correlations expeirments, classification performance of all models drops down to only slightly above chance performance. This suggests that the high classification performance on the full dataset relies largely on word overlap information, and that there is little higher-level phrase meaning information to aid classification when the word overlap cue is removed.

Landmark Experiment

The landmark experiment aimed at assessing phrasal composition: testing models’ ability to select the correct senses of polysemous words in a composed phrase, as proposed by Kintsch (2001). Each test item consists of a) a central verb, b) two subject-verb phrases that pick out different senses of the verb, and c) two landmark words, each associating with one of the target senses of the verb. The reasoning is that composition should select the correct verb meaning, shifting represen- tations of the central verbs — and of the phrase as a whole — toward landmarks with closer meaning.

Landmark experiments. Y-axis denotes the percentage of samples that are shifted towards the correct landmark words in each layer.

The general observation is that the observed results largely parallel those from the uncontrolled versions of the correlation and classification analyses, suggesting that success on this landmark test may reflect lexical properties more than sophisticated composition of the representation.

Conclusions

Across all models, there is non-trivial alignment with human judgment, but it seems to rely on lexical information
With lexical overlap controlled, experiment shows severe performance drop in both similarity correlations and paraphrase classifications
Lack of sophisticated phrase composition beyond word content encoding