Revisiting Knowledge Distillation for Autoregressive Language Models | Synapse