October 10, 2024

10 min read

Rhymes AI Team

Aria: First Open Multimodal Native MoE Model

Aria Multimodal Native MoE - An Open Model for ALL Modalities

Rhymes AI is proud to introduce Aria, the world’s first open-source, multimodal native Mixture-of-Experts (MoE) model.

In short, Aria features:

Multimodal native understanding:
- State-of-the-art performance on a wide range of multimodal and language tasks
- Pre-trained from scratch on a mixture of multimodal and language data
Lightweight and fast:
- Fine-grained mixture-of-expert model with 3.9B activated parameters per token
- Efficient and informative visual encoding of variable image sizes and aspect ratios
Long context window:
- Long multimodal context window of 64K tokens, captioning a 256-frame video in 10 seconds
Open:
- Open model weights 🤗, code repository 💻, technical report 📝 for collaborative development.
- License: Apache 2.0

Multimodal Native Performance

Aria Multimodal Native MoE - An Open Model for ALL Modalities Figure 1. Aria is a multimodal native model that excels at understanding text, vision, code.

Aria processes text, images, video, and code all at once, without needing separate setups for each type, demonstrating the advantages of a multimodal native model.

We provide a quantifiable definition for the term multimodal native:

A multimodal native model refers to a single model with strong understanding capabilities across multiple input modalities (e.g., text, code, image, video) that matches or exceeds the modality-specialized models of similar capacities.

We compared Aria against the best open and closed multimodal native models across established benchmarks, highlighting the following key observations:

Best-in-Class Performance: Aria is the leading multimodal native model, demonstrating clear advantages over Pixtral-12B and Llama3.2-11B across a range of multimodal, language, and coding tasks.
Competitive Against Proprietary Models: Aria performs competitively against proprietary models like GPT-4o and Gemini-1.5 on multimodal tasks, including document understanding, chart reading, scene text recognition, and video understanding.
Parameter Efficiency: Aria is the most parameter-efficient open model. Thanks to the MoE framework, Aria activates only 3.9 billion parameters, compared to the full activation in models like Pixtral-12B and Llama3.2-11B.

Aria Multimodal Native MoE - An Open Model for ALL Modalities Figure 2. Aria shows best-in-class benchmark performance on multimodal, language, coding tasks.

Long Multimodal Input Understanding

Multimodal data is often complex, involving long sequences that combine visuals and text, like videos with subtitles or long documents. For a model to be effective in real-world applications, it must be capable of understanding and processing such data efficiently.

Aria excels in this area, demonstrating superior long multimodal input understanding. It outperforms larger open models, proving its efficiency and effectiveness despite its size. When compared to proprietary models, Aria surpasses GPT-4o mini in long video understanding and outperforms Gemini-1.5-Flash in long document understanding. This makes Aria a preferred choice for processing extensive multimodal data in a compute-and-time-efficient manner, delivering faster and more accurate results in real-world scenarios.

Aria Multimodal Native MoE - An Open Model for ALL Modalities
Figure 3. Aria excels in long multimodal input understanding, such as long video understanding.

Instruction Following

Aria is highly effective at understanding and following instructions on both multimodal and language inputs, performing better than top open-source models on both MIA-Bench and MT-Bench.

Aria Multimodal Native MoE - An Open Model for ALL Modalities Figure 4. Aria is highly effective at instruction following on multimodal and language inputs.

Multimodal Native Training

Aria Multimodal Native MoE - An Open Model for ALL Modalities
Figure 5. Aria's 4-stage training pipeline for multimodal native understanding.

Aria is pre-trained from scratch using a 4-stage training pipeline, ensuring that the model progressively learns new capabilities while retaining previously acquired knowledge.

1. Language Pre-Training:	2. Multimodal Pre-Training:	3. Multimodal Long-Context Pre-Training:	4. Multimodal Post-Training:
Pre-trains the MoE decoder using a large, curated dataset of language data, including programming code. Utilizes a next-token prediction loss to enable the MoE to learn general knowledge about the world. Consumes 6.4T text tokens.	Pre-trains both the MoE decoder and the visual encoder with a mixture of language and multimodal data. Equips the model with broad multimodal understanding capabilities while maintaining or enhancing its language comprehension. The dataset includes: 1T high-quality language tokens. 400B multimodal tokens from various sources, such as: Interleaved image-text sequences from common crawl. Synthetic image captions. Document transcriptions and question-answering pairs. Synthetic video captions and question-answering pairs. A rigorous data curation and filtering procedure ensures data quality and safety.	Extends the model's context window to 64K tokens. Processes 12B text tokens and 21B multimodal tokens, with 69% of the 33B tokens being long sequences.	Anneals the learning rate to help the model converge. Improves the model's question-answering and instruction-following abilities. Consumes 20B tokens of high-quality data.

Qualitative Examples

Multimodal Native Reasoning with Vision, Language, Coding Capabilities

Prompt:
Organize the highest temperature, lowest temperature, and humidity for all even dates into a table.

Write a python code to draw a line chart for the temperatures, and a bar chart for the humidity. Use red and green colors for the temperatures, and use blue color for the humidity.

Aria Multimodal Native MoE - An Open Model for ALL Modalities (source: https://www.timeanddate.com/weather/singapore/singapore/hourly)

Aria response:
Sure! The table of temperature, humidity for even dates is as below:

Date	Highest Temperature	Lowest Temperature	Humidity
20 Sep	33 °C	26 °C	62%
22 Sep	33 °C	26 °C	63%
24 Sep	32 °C	26 °C	70%
26 Sep	29 °C	24 °C	76%
28 Sep	27 °C	26 °C	69%
30 Sep	27 °C	26 °C	66%

The Python code to draw the line chart for temperatures and the bar chart for humidity is as follows:


  import matplotlib.pyplot as plt

  # Data for even dates
  days = ['20 Sep', '22 Sep', '24 Sep', '26 Sep', '28 Sep', '30 Sep']
  high_temps = [33, 33, 32, 29, 27, 27]
  low_temps = [26, 26, 26, 24, 26, 26]
  humidity = [62, 63, 70, 76, 69, 66]

  # Line chart for temperatures
  plt.figure(figsize=(14, 6))

  plt.subplot(1, 2, 1)
  plt.plot(days, high_temps, 'r-', label='Highest Temperature', marker='o')
  plt.plot(days, low_temps, 'g-', label='Lowest Temperature', marker='o')
  plt.xlabel('Days')
  plt.ylabel('Temperature (°C)')
  plt.title('Temperature Variation')
  plt.legend()
  plt.grid(True)

  # Bar chart for humidity
  plt.subplot(1, 2, 2)
  plt.bar(days, humidity, color='blue', alpha=0.7)
  plt.xlabel('Days')
  plt.ylabel('Humidity (%)')
  plt.title('Humidity')
  plt.ylim(0, 100)
  plt.grid(axis='y')

  plt.tight_layout()
  plt.show()

Using the provided code snippet from Aria, we can generate the following visualizations:

Aria Multimodal Native MoE - An Open Model for ALL Modalities

Multimodal Native Reasoning - Financial Report Analysis

Prompt:
How does the company's profit margin change from year 2020 to 2021? What factors cause such changes? Answer step by step.

Aria Multimodal Native MoE - An Open Model for ALL Modalities

(source: https://www.financestrategists.com/wealth-management/financial-statements/)

Aria response:
To determine how the company's profit margin changed from 2020 to 2021, we need to follow these steps:

Calculate the profit margins for both years:
- 2020:
  - Net income: $57,411 million
  - Total net sales: $274,515 million
  - Profit margin = (Net income / Total net sales) * 100 = ( $57,411 /$ 274,515) * 100 ≈ 20.91%
- 2021:
  - Net income: $94,680 million
  - Total net sales: $365,817 million
  - Profit margin = (Net income / Total net sales) * 100 = ( $94,680 /$ 365,817) * 100 ≈ 25.88%
Analyze the factors causing the change:
- Increase in Net Income: The net income increased from $57,411 million in 2020 to $94,680 million in 2021, showing a significant rise of $37,269 million.
- Sales Growth: Total net sales also increased from $274,515 million in 2020 to $365,817 million in 2021, indicating a strong growth in revenue.
- Cost Management: The cost of sales increased from $151,286 million in 2020 to $192,266 million in 2021, but the percentage increase in sales outpaced the increase in costs, leading to a higher profit margin.
- Operating Expenses: Operating expenses grew from $38,668 million in 2020 to $43,887 million in 2021, but the rise in sales was more substantial, positively impacting the profit margin.

Therefore, the company's profit margin improved from approximately 20.91% in 2020 to approximately 25.88% in 2021. This improvement is primarily due to a substantial increase in net income and effective cost management relative to the growth in total net sales.

Long Multimodal Context Window

Prompt:
Based on the five species in the images, draw a food chain. Explain the role of each species in the food chain.

Aria response:

Mushrooms (Image 1): Decomposers, breaking down dead organic matter and recycling nutrients.
Grass (Image 4): Primary producers, converting sunlight into energy via photosynthesis.
Grasshopper (Image 2): Primary consumers, herbivores feeding on plants.
Mouse (Image 3): Secondary consumers, herbivores that eat plants and possibly small insects.
Hawk (Image 5): Tertiary consumers, predators that eat mice and other small animals.

Food Chain:
Grass → Grasshopper → Mouse → Hawk → Mushrooms

Explanation:

Grass serves as the base, providing energy to grasshoppers.
Grasshoppers eat the grass and are then consumed by mice.
Mice are preyed upon by hawks.
Hawks are apex predators, feeding on mice.
Mushrooms decompose dead matter, recycling nutrients back into the ecosystem.

Long Multimodal Input - Paper Reading

Prompt:
Summarize the key contribution of the paper, and explain how it uses various figures and tables to showcase such contribution.

Aria response:

The paper introduces LONGVIDEOBENCH, a benchmark for evaluating Large Multimodal Models (LMMs) in long-context video-language understanding. It addresses the challenge of processing longer video inputs by proposing a question-answering benchmark that tests the models' ability to understand and reason about video contexts interleaved with up to an hour-long video. This benchmark includes 3,763 videos with diverse themes and 6,678 human-annotated multiple-choice questions across 17 categories.

The key contribution is the comprehensive evaluation of LMMs' performance on long-form video understanding, highlighting the limitations of existing models in handling extended inputs. The paper demonstrates that while proprietary models like GPT-4 and Gemini-1.5-Pro show improvements with longer inputs, open-source models lag significantly behind.

Figures and tables are used to illustrate these points:

Figure 1: Demonstrates the performance of different models on the benchmark, showing how accuracy varies with the number of frames.
Figure 2: Provides examples of the 17 categories of referring reasoning questions.
Table 1: Lists the benchmark and previous benchmarks for comparison.
Table 2: Details the 17 categories of referring reasoning questions.
Table 3: Shows statistics of videos in LONGVIDEOBENCH by category groups and duration.
Table 5: Validation set results categorized by duration groups.
Table 7: Test set leaderboard of LMMs by duration groups and question categories.

These visual aids help in understanding the performance metrics, the diversity of the dataset, and the specific challenges posed by longer videos and complex questions, thus emphasizing the need for improved models in long-form video understanding.

Long Multimodal Input - Split Video and Write Scene Descriptions

Prompt:
Please split this video into scenes, providing start time, end time, a title and detailed descriptions for each scene.

(source: https://www.youtube.com/watch?v=Gh22bt1GYoQ)

Aria's Response:
Certainly! Here are the scenes from the video, including their start and end times, titles, and detailed descriptions:

Scene 1: Introduction	Scene 2: Cooking Demonstration	Scene 3: Potato Preparation	Scene 4: Potato Mashing	Scene 5: Potato Transfer	Scene 6: Therapist Matching Quiz	Scene 7: Cooking Continuation	Scene 8: Potato Wrapping	Scene 9: Cooking Meat	Scene 10: Final Preparation	Scene 11: Plating the Dish	Scene 12: Conclusion
Start Time: 00:01 End Time: 00:13 Title: Introduction Description:The video opens in a modern kitchen where an individual in a black shirt and apron stands behind a counter. The kitchen is well-organized with various spices and knives visible. The person appears to be preparing to demonstrate a cooking process.	Start Time: 00:16 End Time: 00:29 Title: Cooking Demonstration Description: The focus shifts to a close-up of a plate with sliced meat and a yellow sauce. A spoon is used to serve the meat, emphasizing the cooking process.	Start Time: 00:32 End Time: 01:18 Title: Potato Preparation Description: The scene returns to the kitchen where the individual is holding a cookbook and standing near a pot on a stove. The person is preparing to cook potatoes, with various ingredients and utensils visible.	Start Time: 01:22 End Time: 03:49 Title: Potato Mashing Description: The individual is now mashing potatoes in a pot, using a masher. The process is shown in detail, with the person adding butter and milk to the mashed potatoes.	Start Time: 03:53 End Time: 04:19 Title: Potato Transfer Description: The mashed potatoes are transferred to a metal bowl. The individual uses a spatula to spread the potatoes evenly in the bowl.	Start Time: 04:22 End Time: 05:08 Title: Therapist Matching Quiz Description: The scene shifts to a webpage for BetterHelp, featuring a therapist matching quiz. The page highlights the benefits of online therapy and includes testimonials.	Start Time: 05:11 End Time: 06:20 Title: Cooking Continuation Description: The video returns to the kitchen where the individual is preparing a piece of meat. The person is shown cutting and seasoning the meat, with various kitchen tools and ingredients visible.	Start Time: 06:23 End Time: 07:59 Title: Potato Wrapping Description: The individual is now wrapping the meat in a green leaf and tying it with string. The process is shown in detail, with the person carefully wrapping the meat.	Start Time: 08:02 End Time: 09:50 Title: Cooking Meat Description: The wrapped meat is placed in a pot with a dark liquid. The person adds various ingredients to the pot, including a seasoning mix and a liquid from a bottle.	Start Time: 09:53 End Time: 11:57 Title: Final Preparation Description: The individual is seen preparing a large piece of meat on a tray. The person seasons the meat and places it in a frying pan. The scene also includes a brief outdoor workout session.	Start Time: 11:58 End Time: 13:37 Title: Plating the Dish Description: The final dish is plated, with the meat placed on a bed of mashed potatoes. The individual is joined by another person, and they both taste the dish.	Start Time: 13:38 End Time: 13:56 Title: Conclusion Description: The video concludes with a promotional image for a cooking course. The individual is shown in a chef's outfit, with the course title and author's name displayed. The video ends with a call to action for viewers to like and subscribe.

The video has a total of 12 scenes, each with a specific focus and detailed description.

Model Architecture

Aria comprises a vision encoder and a mixture-of-experts decoder. The vision encoder produces visual tokens for images/videos in their native aspect ratio. It operates in three resolution modes: medium, high, and ultra-high. A medium-resolution image is represented using 128 visual tokens; a high-resolution image is represented using 256 visual tokens; an ultra-high resolution image is dynamically decomposed into multiple high-resolution sub-images.

Aria Multimodal Native MoE - An Open Model for ALL Modalities Figure 6. Aria's vision encoder operating in three resolution modes: medium, high, and ultra-high, with varying token representations.

The MoE decoder is multimodal native, and makes predictions conditioned on both language and visual input tokens. Specifically, Aria has 66 experts in each MoE layer, 2 of the 66 experts are shared among all inputs to capture common knowledge, whereas 6 more experts are activated for each token by a router module.

Aria Multimodal Native MoE - An Open Model for ALL Modalities
Figure 7. Aria's multimodal native MoE decoder.

Developing with Aria

Aria is designed to be developer-friendly, offering extensive support and flexibility. To facilitate development and collaboration, Rhymes AI provides a codebase that facilitates adoptions of Aria in downstream applications.

The codebase features:

Fast and easy inference with Transformers or vllm
Cookbooks and best practices for using Aria
Fine-tuning Aria on various dataset formats, using as few as a single GPU

Looking Ahead

With the launch of Aria, we're aiming to redefine how AI integrates into our lives. Our focus is on creating tools that genuinely benefit people, expanding what’s possible in both technology and human connection.

We are excited to collaborate with researchers, developers, and innovative organizations to explore new frontiers in AI. Together, we can build solutions that are not only cutting-edge but also impactful and meaningful for society.

We believe the future of AI is multimodal, efficient, and seamlessly integrated into everyday experiences. Aria is just the beginning. Let’s shape the future of AI together.