This model inherits from PreTrainedModel. Test the superclass documentation to the generic solutions the
running on byte-sized tokens, transformers scale improperly as each and every token need to "go to" to every https://agnespalk851656.bleepblogs.com/30521473/5-essential-elements-for-mamba-paper