Generative artificial intelligence and foundation models have changed machine learning by allowing systems to produce readable text, realistic images, and other multimodal content with little direct input from a user. Foundation models are large neural networks trained on very large and varied datasets, and they form the core of many current generative AI (GenAI) systems. Their rapid development has led to major advances in areas like natural language processing, computer vision, multimodal learning, and robotics. Examples include GPT, LLaMA, and diffusion-based architectures, such as models often used for image generation. Systems such as Stable Diffusion show this shift by illustrating how AI can interpret information, draw basic inferences, and produce new outputs using more than one type of data. This review surveys common foundation model architectures and examines what they can do in generative tasks. It reviews Transformer, diffusion, and multimodal architectures, focusing on methods that support scaling and transfer across domains. The paper also reviews key approaches to pretraining and fine-tuning, including self-supervised learning, instruction tuning, and parameter-efficient adaptation, which support these systems’ ability to generalize across tasks. In addition to the technical details, this review discusses how GenAI is being used for text generation, image synthesis, robotics, and biomedical research. The study also notes continuing challenges, such as the high computing and energy demands of large models, ethical concerns about data bias and misinformation, and worries about privacy, reliability, and responsible use of AI in real settings. This review brings together ideas about model design, training methods, and social implications to point future research toward GenAI systems that are efficient, easy to interpret, and reliable, while supporting scientific progress and ethical responsibility.
Elhanashi et al. (Fri,) studied this question.