Revolutionizing Language: Transformers vs. RNNs — Unleashing the Power of Attention

Amber Ivanna Trujillo
6 min readOct 31, 2023

In the ever-evolving landscape of artificial intelligence, there’s a seismic shift that has reshaped the way we understand and generate language. In the not-so-distant past, Recurrent Neural Networks (RNNs) held the torch, attempting to navigate the complex terrain of language with mixed success. But then came a game-changer: Transformers. Brace yourselves for a journey through the evolution of language models, where attention becomes the linchpin to unparalleled progress.

The RNN Era: A Glimpse into the Past:

Once upon a time, RNNs were the architects of text prediction. They were formidable in their own right, but as the demands of language complexity grew, their limitations became apparent. Simple tasks, such as Natural Language Generation (NLG), often pushed RNNs to their computational boundaries. The more text they had to process, the more they demanded in terms of computing power and memory, rendering them mediocre even after extensive scaling efforts.

A fully recurrent network. Created by fdeloche at Wikipedia, licensed as CC BY-SA 4.0. No changes were made.

Where x, h, o are the input sequence, hidden state, and the output sequence, respectively. U, V, and W are the training weights.

RNNs are particularly effective in capturing short-term dependencies in sequences. However, they suffer from the vanishing gradient problem, where the influence of earlier inputs diminishes exponentially as the sequence progresses, making it difficult to capture long-term dependencies.

For Example — While RNNs show good results for shorter sequences, in longer ones (e.g. “Every week, I repetitively wash my clothes, and I do grocery shopping every day. Starting next week, I will …”) the distance between the relevant words can be too long for results to be acceptable

LSTM is a specific type of RNN architecture that addresses the vanishing gradient problem, which occurs when training deep neural networks. LSTMs leverage memory cells and gates to selectively store and retrieve information over long sequences, making them effective for capturing long-term dependencies.



Amber Ivanna Trujillo

I write about Technical stuff, interview questions, and finance Deep Learning, LLM, Startup, Influencer, Executive, Data Science Manager