Soket AI Labs has unveiled Pragna-1B, an open-source multilingual language model designed to cater to the rich linguistic diversity of India. In collaboration with Google Cloud, this model aims to bridge the gap in AI language models for Indic languages, providing robust support for Hindi, Gujarati, Bangla, and English.
Overview of Pragna-1B
Pragna-1B is a Transformer Decoder-only model with 1.25 billion parameters and a context length of 2048 tokens. Developed over six months, the model’s training involved 150 billion tokens and extensive computational resources, including 8000 GPU hours on NVIDIA A100 systems.
Training and Development
The development of Pragna-1B involved several key steps:
- Embedding Alignment: Initially, only the embedding and lm_head were aligned, keeping other tensors frozen. A parallel sentences dataset from Bhasha-wiki, pairing sentences in six Indian languages with their English counterparts, facilitated this alignment.
- Continual Pretraining: All 1.25 billion parameters were enabled for further training, focusing on Hindi, Bangla, and Gujarati due to computational constraints. The model processed approximately 150 billion tokens over 8000 GPU hours, maintaining high sampling probabilities for these languages.
- Instruction Fine-Tuning: The model underwent supervised fine-tuning across various tasks such as conversation, question-answering, summarization, and paraphrasing. This step incorporated over 13 million instances of instruction-response data from multiple sources including Bhasha-SFT, Indic-align, and Samvaad.
Ethics and Safety Alignment
Soket AI Labs places a significant emphasis on ethical AI. The model’s fine-tuning includes specific datasets designed to prevent the generation of unethical or harmful content. This focus on safety and ethics is crucial for ensuring that Pragna-1B aligns with human values.
Community Engagement and Future Plans
Soket AI Labs plans to release Pragna-1B under an open-source license, inviting feedback from the community to refine and enhance the model further. An initial research preview of the instruction-tuned model is available via a chat interface, although it is not recommended for production use due to potential factual inaccuracies.
Significance and Potential
Pragna-1B represents a significant advancement in the field of AI language models for Indic languages. By focusing on linguistic inclusivity and ethical AI practices, Soket AI Labs aims to contribute to the broader AI community and enhance user engagement across diverse linguistic landscapes.
The collaboration with Google Cloud underscores the importance of leveraging advanced cloud infrastructure to develop and deploy AI models efficiently. As AI technology continues to evolve, models like Pragna-1B are poised to play a crucial role in making AI accessible and useful for a wider audience.
Add Comment