Long Range Arena: A Benchmark for Efficient Transformers – Summary and Comments

Joel Niklaus
4 min readOct 29, 2021

By Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu Yang, Sebastian Ruder, Donald Metzler

Standard transformers have problems with long input because of their quadratic self-attention module. That’s why many “efficient” transformers have been proposed recently. However, there is no unified way of comparing them, since each evaluates on different data. With the Long Range Arena Benchmark, the authors want to change this. So, they want to create a unified, systematic, common benchmark to evaluate this class of efficient transformers. They propose a suite of six tasks ranging from 1K to 16K tokens from a wide range of data types and modalities.

They offer the following thoughts in the construction of the benchmark:

  1. Generality: All efficient Transformers models should be applicable to our tasks. For instance, given that not all xformer models are able to perform autoregressive decoding, we include tasks that only require encoding.
  2. Simplicity: The tasks should have a simple setup. All factors that make comparisons difficult should be removed. This encourages simple models instead of cumbersome pipelined approaches. For instance, we avoid including any particular data augmentation and consider pretraining to be out of scope of this benchmark.
  3. Challenging: The tasks should be difficult enough for current models to ensure there is room for improvement to encourage future research in this direction.
  4. Long inputs: The input sequence lengths should be reasonably long since assessing how different models capture long-range dependencies is a core focus of LRA.
  5. Probing diverse aspects: The set of tasks should assess different capabilities of models like their ability to model relations and hierarchical/spatial structures, generalization capability, etc.
  6. Non-resource intensive and accessible: The benchmarks should be deliberately designed to be lightweight so as to be accessible to researchers without industry-grade computing resources.

I think these guidelines make a lot of sense. I would maybe add a seventh one:
7. Realistic: The tasks should be close to actual real-world use cases to ensure the transferability of the results to realistic deployment scenarios.

This brings me to the main point of critique of the paper already, but I am getting ahead of myself. Let me introduce first introduce the tasks:

Long ListOps

Here they are interested in “hierarchically structured data in a long-context scenario”. The input lengths range up to 2K tokens.
They give a well illustrating example:

INPUT: [MAX 4 3 [MIN 2 3 ] 1 0 [MEDIAN 1 5 8 9 2]] OUTPUT: 5

Byte-Level Text Classification

They reason, that text classification is a very common use case of transformers “with many real-world applications such as spam, fraud, and bot detection and commercial document classification”. They use the IMBb reviews dataset and give the models the bytes as input to simulate longer sequences. This means that the models don’t form tokens based on subwords, but based on bytes (should correspond directly to characters, since the texts are in English). I find this decision a bit poorly motivated since subword-based models seem to be much more common. So, I don’t really understand, why they didn’t just choose a dataset that already contains long input, and then tackle it normally. For example in the field of legal NLP, the input is typically quite long by default. They could have taken any dataset of e.g. LexGlue.

Byte-Level Document Retrieval

In this task, they want to test the models’ ability to “compress long sequences into representations suitable for similarity-based matching”. They use the ACL Anthology Network dataset where two papers are considered related if they have a citation link. Again, they use a byte/character level setup to induce the long inputs. Again, I wonder, if there are no datasets available which already naturally contain longer text input.

Image Classification on Sequences of Pixels

In this task they do image classification by treating each pixel as a token (instead of patches of pixels like for the Image Transformer). They say it makes the model learn 2D spatial relations from a 1D sequence of tokens.

Pathfinder (Long-Range Spatial Dependency)

In the Pathfinder task they require the model “to make a binary decision whether two points represented as circles are connected by a path consisting of dashes”. They give the models pictures with 32x32 pixels resulting in a sequence length of 1024.

Pathfinder-X (Long-Range Spatial Dependencies with Extreme Lengths)

In the Pathfinder-X task they just increase the picture size to 128x128 resulting in a sequence length of 16K. The goal here is to see if there is a difference in difficulty arising solely by the length of the sequence in isolation.

Conclusion

In the figure below, they compare the models by performance and speed with Big Bird being the most performant model and Performer the fastest. Since in my research problems performance is more central, I would probably go with Big Bird.

All in all, I think the idea is great and there is definitely a need for this kind of benchmark. I personally would have liked them to include some more realistic and less artificial tasks.

--

--

Joel Niklaus
0 Followers

Swiss PhD Student blogging about NLP