Multi-Layer Transformers Gradient Can be Approximated in Almost Linear Time | Synapse