Artificial Intelligence – Imaginario AI https://imaginario.ai Search, convert and publish videos with AI Mon, 15 Dec 2025 16:23:03 +0000 en-US hourly 1 https://wordpress.org/?v=6.9.1 https://imaginario.ai/wp-content/uploads/2023/11/cropped-Imaginario-logo-32x32.png Artificial Intelligence – Imaginario AI https://imaginario.ai 32 32 Part 2 – Enterprise AI in Media: Use Cases and the Challenge of Context in the Agentic Future https://imaginario.ai/wide-lens/artificial-intelligence/part-2-enterprise-ai-in-media-use-cases-and-the-challenge-of-context-in-the-agentic-future/ Mon, 15 Dec 2025 16:21:56 +0000 https://imaginario.ai/?p=2087

‘Constrained agency’ is becoming the rule in Enterprise: AI handles the heavy lifting, while humans make the decisions. Context is the next challenge.

This is the second of a three-part post about a few key insights and findings that I came across when attending the Digital Production Partnership (DPP) Leaders Briefing 2025 in London. The DPP is one of the most important trade associations in the media industry that brings together senior executives from major media organizations. More than 1,000 attendees came to this conference held on November 18 and 19.

The transformation from AI demos to production systems is underway across media and entertainment. But the more interesting story isn’t that AI has arrived, it’s where it’s actually working, where it’s struggling, and what infrastructure changes are required to unlock the next wave of value.

Where AI Is Delivering Real Value in the Media Enterprise space

Use Case 1: Ingest and Metadata. The Highest-Leverage Point for Automation

High-quality and multimodal metadata at point of ingestion lies at the heart of building intelligent and robust AI systems further downstream in the media supply chain.

The standout finding from recent industry surveys such as the DPP’s CEO and CTO surveys is that metadata generation and enrichment, particularly speech-to-text and automated speech recognition, has crossed the adoption chasm. More than 80% of organizations are either encouraging or actively implementing these capabilities.

The logic is sound. The oldest rule in data engineering applies: garbage in, garbage out. If your metadata is incomplete, inconsistent, or wrong, everything downstream suffers. Search breaks. Rights management fails. Compliance becomes manual. Smart organizations have realized that the front door of the supply chain is the highest-leverage point for automation.

The DPP’s Media AI Radar 2026, based on industry interviews, forecasts higher adoption of AI in ideation, planning, analytics, commissioning, compliance, video marketing, advertising, and ingest/logging.

The more sophisticated implementations involve what’s being called “agentic orchestration”: AI systems that don’t just transcribe content but actively monitor ingestion workflows for anomalies.

For example, a file arrives with metadata that doesn’t match expected formats or the level of depth and accuracy required. Maybe the localization markers are wrong, or the language code contradicts the filename. Traditional workflows would let that error flow downstream until a human spotted it.

With agentic orchestration applied in Q/C, the system flags the mismatch immediately, provides context, and prompts investigation. A person remains in charge, operating with better information and recommendations earlier in the process.

However, as we will see later in this post, agentic AI doesn’t just need highly accurate metadata but also ways to access content and data silos in a more fluent, robust, and secure way. This remains a challenge for large organizations.

Use Case 2: Ingest and Metadata. The Highest-Leverage Point for Automation

AI is heavily disrupting the localization industry. Yet, the media industry continues to maintain a high bar for quality dubbing and audio descriptions. Humans and AI together produce the best output.

Subtitling, captions, and synthetic dubbing are seeing strong adoption (especially the first two), but with an important caveat: only 14% of teams find auto-generated captions “fully usable” without human review, and 75% cite timing and sync issues as their primary complaint.

Bad subtitles aren’t just annoying, they’re trust-destroying. The pattern emerging is AI-assisted drafting plus mandatory human review. Speech-to-text models generate initial captions. LLMs detect sync problems and propose corrections. Linguistic experts review the final files. The AI handles grunt work; humans ensure quality.

Use Case 3: Marketing and Promotion – Dramatic Time Savings

Beyond the hype. Automated and semi-automated video repurposing is proving to be a growing revenue pipeline for companies with long-form content, such as those in Sports, and a key way to monetize deep high-quality catalogues.

This is where AI delivers some of the most compelling efficiency gains and this goes beyond flashy Nano Banana text to video generative AI. A single piece of premium long-form content (sports matches, TV episodes, etc) might need dozens of promotional assets tailored for different platforms, audiences, and campaigns.

These are AI tools that can analyze content, identify emotional beats, extract compelling sequences, and generate draft cuts in minutes. What used to take days happens in hours or minutes. This is one of the use cases driving the highest demand for Imaginario AI, and this cuts across media, entertainment, sports, news and even SMB content.

In the case of Imaginario AI, we have seen our Enterprise clients save between 50% and 75% in searching deep catalogues, navigating dailies, and generating social media compilations or single cuts. This is quantifiable ROI.

Use Case 4: Contextual Ad Discovery and Insertions. The Next Frontier in Programmatic Advertising

AI contextual ad discovery understands the nuance of content to place ads in the perfect environment. This enables highly personalized creative that align with a user’s immediate mindset and mood, boosting engagement without relying on personal data.

AI tools can analyze narrative structure, detect scene transitions, and suggest non-intrusive ad slots based on emotional pacing. They can also explain their reasoning, which is critical for creative buy-in. The tools that work don’t make autonomous placement decisions. They suggest options, show reasoning, and make overrides easy.

The Maturity Gap: Where We Actually Are

For all this progress, humility and baby steps are still in order. When surveyed by the DPP about media supply chain maturity, 61% of respondents said “developing” and 36% said “maturing.” Only 3% called it “advanced.” Nobody, not a single respondent, said “leading”.

The vast majority of media companies are still developing their cloud, interoperability and AI capabilities. They expect to reach a maturity stage in three years. Sources: DPP CTO Survey 2025.

Projected three years ahead, roughly half expect to be maturing, a third hope for advanced, and just 3% think the industry will be truly leading.

We’re in the early innings. The teams deploying AI today are pioneers, not late adopters. Most organizations are still figuring out use cases, governance, training, and change management. The technology is ahead of organizational readiness.

Co-Pilot, Not Autopilot: The Design Pattern That Works

The highest gains are seen when humans and AI do what they do best: AI handles data processing, automation of repetitive tasks, and generating drafts. Humans use critical judgement, context, creativity, and empathy.

Across trust, integration, governance, security, and perceived technology maturity, concern levels spike whenever AI autonomy increases. People want smart assistants. They do not want self-driving compliance, QC, or editorial decisions.

The pattern that’s working: constrained agency. AI with enough freedom to do meaningful work, but within boundaries that remain legible and controllable by humans. The level of acceptable autonomy varies by use case. Legal compliance has a higher accuracy bar than generating marketing highlights, but the principle holds.

This isn’t because the technology can’t handle more autonomy. It’s because organizations aren’t ready to trust it, and more importantly, because accountability matters. When something goes wrong, someone has to answer for it. That someone needs to be a human.

As one technology vendor put it perfectly during the DPP Leadership Summit last November in London: “AI handles the heavy lifting; humans handle the decisions.”

The Enterprise Challenge: A Fragmented Context

AI implementations often stall because essential enterprise context is locked in fragmented systems and unstructured media formats like video and audio. Making diverse data accessible is critical insfrastructure work required for scalable AI success.

For AI agents to be effective in the enterprise, they need enterprise context. That context sits in contracts, financial documents, research, marketing assets, meeting notes, conversations, and every other piece of information across the organization. By volume, most of this data is unstructured and remains both on-premises and in the cloud. There is no one-size-fits-all solution, as each organization is at a different stage of digitalization and cloud development.

Your customer information might live in HubSpot. Projects are in Monday. Engineering issues sit in Jira. Financials are in Xero or SAP. Your product catalog exists in a different database and MAM systems. Branding and marketing is scattered across DAMs and creative toolkits like Adobe or Avid.

Enterprise context is highly fragmented and requires bespoke integrations, taxonomies, and adapted schemas that ensure data consistency and make it easier for different systems to understand and exchange data. Without this, even the smartest AI still won’t know your organization and your workflows, therefore providing inaccurate information and potential hallucinations.

This fragmentation is why so many AI implementations stall after promising pilots. The AI works brilliantly on a single data source but struggles when it needs to compose context from multiple systems to answer real business questions.

And there’s another layer of complexity: a massive portion of enterprise context is locked in complex media formats like video, audio, and images that agents can’t natively process. Making these formats truly legible to AI is essential infrastructure work that most organizations haven’t yet addressed. Until this unstructured content becomes structured and searchable, AI agents will operate with significant blind spots.

The only way AI agents will be successful at scale is if they have access to the right information, in the right structure, at the right time, in a secure and well-governed way. AI will expand the use and value of this information by orders of magnitude over time, but only if the infrastructure exists to make it accessible.

MCP and the Agentic Architecture Shift

A single contextual layer to rule all APIs and bespoke integrations in the Era of AI: Model Context Protocol (MCP). Source: Descope.

This fragmentation problem is exactly why the Model Context Protocol (MCP) ecosystem has become strategically important, even with APIs available.

MCP, originally developed by Anthropic and now donated to the Linux Foundation’s new Agentic AI Foundation, provides MCP clients with a universal, open standard for connecting AI applications to external databases, servers, and third-party systems. It’s not just about connecting agents to a system, it’s about composing context from all the systems that collectively hold your enterprise’s cognitive reality.

The numbers tell the story of rapid adoption: over 10,000 active public MCP servers exist today, covering everything from developer tools to Fortune 500 deployments. MCP has been adopted by ChatGPT, Cursor, Gemini, Microsoft Copilot, Visual Studio Code, and other major AI products. Enterprise-grade infrastructure now exists with deployment support from AWS, Cloudflare, Google Cloud, and Microsoft Azure. This will likely become the underlying glue of AI.

What we’re seeing emerge is an evolution from atomized MCPs (reflecting individual APIs) to aggregated, orchestrated context layers that can assemble meaningful context on demand. The winners in the AI-enabled enterprise won’t just be the platforms that secure their own data well. They’ll be the ones that embrace composability and make their context genuinely AI-visible.

The agentic future requires infrastructure that makes context available, structured, and secure across the entire SaaS ecosystem. That’s the architecture shift everyone is building toward.

What This Means for the Next Wave

If you’re building AI Enterprise tools for media or implementing them in your organization:

  • Prioritize the ‘front door’. The highest-leverage automation opportunities are at ingest and metadata generation at scale. Get that right, and value compounds downstream. Note: Imaginario AI provides labeled and vector-based contextual understanding and can help your organization in this part of the supply chain.
  • Design for transparency. If your AI makes recommendations, show your reasoning. Black boxes don’t scale where trust must be earned through explainability.
  • Assume human review first. The workflows that work (today) are ones where AI drafts and humans approve. Plan for that from the start, optimize for automation later.
  • Solve the fragmentation problem. Invest in infrastructure that makes enterprise context composable, secure, and AI-visible across your SaaS ecosystem. MCP adoption is accelerating for good reason. APIs are rapidly adapting to support MCP and Imaginario AI is not the exception.
  • Build for constrained agency. Give your AI enough freedom to do real work, but keep it within boundaries humans can understand and control.

The transformation is messy, uneven, and slower than the hype cycle suggests. But it’s happening. AI is moving from the margins to the mainstream, from idiotic hype to efficient supply chains.

We’re not at full automation for most workflows, as context and personalization at scale still need to be completely solved. However, we’re at something more mundane and more valuable: AI as co-pilot, lifting cognitive weight and letting humans focus on judgment and creativity.

The infrastructure work, making enterprise data composable, structured, and AI-visible, is less glamorous than demos of agents completing complex tasks autonomously. But it’s the foundation everything else depends on. The organizations that invest in this plumbing now will be the ones positioned to capture value as agentic capabilities mature.

That might not make for breathless keynote presentations. But it’s the future that’s actually getting built.leaders in the first place (hint: premium content).


About Imaginario.ai
Backed by Techstars, Comcast, and NVIDIA Inception, Imaginario AI helps media companies turn massive volumes of footage into searchable, discoverable, and editable content. Its Cetus™ AI engine combines speech, vision, and multimodal semantic understanding to deliver indexing, simplified smart search, automated highlight generation, and intelligent editing tools.

]]>
Imaginario AI and Ortana Media Group Partner to Supercharge Video Workflows with AI-Native Indexing and Automation https://imaginario.ai/wide-lens/press/imaginario-ai-and-ortana-media-group-partner-to-supercharge-video-workflows-with-ai-native-indexing-and-automation/ Thu, 14 Aug 2025 12:45:34 +0000 https://imaginario.ai/?p=1875

LONDON, 14th August 2025 — Ortana Media Group, creators of the Cubix Media Aware Workflow Engine (MAWE), today announced a strategic technology integration with Imaginario.ai, the company behind the Cetus™ AI engine, a next-generation platform for multimodal video understanding, smart search, and AI-powered editing.This collaboration combines Ortana’s robust media orchestration and automation layer with Imaginario’s AI-native infrastructure, empowering media organisations to turn raw video into enriched, discoverable, and production-ready content with unprecedented speed and scale.

A Leap Forward in Video Intelligence & Orchestration

Designed for streaming services, post houses, broadcasters, and corporate marketing teams, the integration delivers a future-ready solution that:

1. High-accuracy multimodal indexing at scale – Powered by Imaginario’s proprietary Cetus™ engine, the system performs advanced scene understanding, shot detection, facial analysis, object recognition, speech-to-text, ambient sound and SFX detection, logo recognition, OCR, image-to-video matching, and multilingual transcription.

2. Automated end-to-end workflows – Through Ortana’s Cubix MAWE, the integration orchestrates ingest, AI tagging, metadata enrichment, transcoding, and asset movement across hybrid or cloud environments — reducing manual effort and accelerating turnaround.

3. Deep semantic search and dynamic discovery – Enables users to quickly find and repurpose scenes or moments, regardless of archive size, file format, or content type.

4. API-native and modular by design – Offers seamless integration into existing MAM/PAM ecosystems or as a standalone solution — supporting flexible, cloud-agnostic deployments.

5. Explainable AI with scene-level captioning – Generates descriptive scene captions and match scoring to clarify how content is indexed — enhancing transparency, user trust, and tagging accuracy.

Unlocking the Value of Video at Scale

This joint solution enables post-production, marketing, and archive teams to spend less time rewatching content and more time creating. Whether for programmatic content creation, social repackaging, contextual advertising, or large-scale archive revitalisation, teams can now:

– Search video by who, what, where, when, and even why, using multimodal prompts.
– Automatically surface the best moments for highlights, trailers, ad inventory, and shorts, all within existing workflows.
– Push enriched assets directly into cloud storage and post-production tools like Adobe Premiere or DaVinci Resolve.

The integration is available immediately and supported across Cubix Yunify, Appliance, Halo, and Connect. It is already being evaluated by several broadcasters and content providers in Europe and North America.


About Imaginario.ai
Backed by Techstars, Comcast, and NVIDIA Inception, Imaginario AI helps media companies turn massive volumes of footage into searchable, discoverable, and editable content. Its Cetus™ AI engine combines speech, vision, and multimodal semantic understanding to deliver indexing, simplified smart search, automated highlight generation, and intelligent editing tools.

About Ortana Media Group
Ortana develops scalable and modular media orchestration solutions. Its flagship platform, Cubix, enables organisations to manage content workflows across cloud and on-premise environments, providing full visibility and control from ingest to distribution.

For more information, please contact:

Jose M. Puga
CEO
Imaginario AI
support@imaginario.ai

]]>
Can we generate B-roll with AI yet? https://imaginario.ai/wide-lens/artificial-intelligence/can-we-generate-b-roll-with-ai-yet/ Wed, 31 Jul 2024 11:04:53 +0000 https://imaginario.ai/?p=1608

When OpenAI first demoed their text-to-video model Sora we, along with a large swathe of the media industry, thought “wow, this is going to be a game-changer for B-roll.”

Generative AI video is still way too random and inconsistent to be used for A-roll. Characters, objects and settings will look different shot-to-shot, so getting reliable continuity is basically impossible. We learned a few months after the Sora unveil that even one of the featured videos – air head by Shy Kids – required substantial correction in post-production to remove inconsistencies and genAI weirdnesses. Shy Kids estimated they generated 300 minutes of Sora footage for every usable minute.

As with all AI efforts right now, we’re seeing huge progress towards more usable systems, and generative video AI startups like Odyssey have already appeared specifically promising the consistency and continuity necessary for good storytelling.

So for now, genAI video isn’t ready to tell stories all by itself. But maybe it can be a part of the storytelling process by producing B-roll. Any generative AI system is only as good as its training data, and there are millions of hours of establishing shots, landscapes, cityscapes and more out there. So lets put it to the test.

I’m going to include 6 of the most popular free generative AI systems on the market right now with a few different styles of prompt. I’m only going to use systems which allow full video generation from a text prompt, not systems which animate images.

Every generative AI system is unique and responds to different types of prompts in different ways, so this shouldn’t be seen as a test of which text-to-video AI is “the best” – you will definitely be able to get better results from each system by playing with the prompt and settings, and experimenting with each to get the best out of it.

The most important test for genAI systems right now – whether text, image or video – is if their output can appear as if it doesn’t come from a genAI system. That’s the benchmark we’ll be applying.

The systems we’ll be using:

Test 1 – A cityscape at night

The prompt

A panning shot of a present-day city at night. Streets, buildings and billboards fill the entire frame.

The results

The verdict

There are elements from a few videos that could be usable, specifically the middle-distance and skyline shots. The buildings created by Runway and Luma are very close to realistic, and the skylines in all shots that contain them are passable.

However without fail the traffic is a disaster – complex moving elements continue to be the achilles heel of generative AI video, and it will be interesting to see if the upcoming models from larger providers (particularly Sora from OpenAI and Veo from Google) can make improvements here.

Test 2 – A forest at sunset

The prompt

The camera pans upwards from the treeline of a pine forest to reveal rolling hills beyond covered in trees, with mist resting in valleys between the hills. On the right side of the frame the sun is setting behind the trees in the distance, partially obscured by wisps of cloud, while a small flock of birds flies on the left side of the frame.

The results

The verdict

Now, these are much better results. There are a few genAI artifacts (Pixverse and Haiper’s birds, in particular), but overall these shots are usable. And perhaps more importantly for people generating footage for use in projects, these shots look like what I was picturing in my head when I wrote the prompt.

I purposely included multiple instructions in the prompt to see which model would follow them best. The individual elements were:

  • Camera movement
  • Type of trees
  • Misty valleys
  • Position of the birds
  • Position of the sun, with clouds and trees in front

I was pleasantly surprised to see that most of the models followed most of these instructions – a few missed the birds, but all of them nailed the forest, the misty valleys and the sunset. One notable curiosity is that only Kling followed the instruction to pan the shot correctly, every other model went for more of a drone or dolly shot with some movement. Kling’s generation interface specifically includes camera controls, so it makes sense it would understand this part of the prompt better.

Test 3 – a stormy seascape

The prompt

The camera flies quickly over a calm sea, we see the water moving with a few waves as we pass close above it. The camera pans upwards to reveal the horizon with a thunderstorm brewing in the distance.

The results

The verdict

In this test we can clearly see some unintended video hallucinations. In particular Haiper, which included the wake of a boat, and Pixverse, whose shot has been invaded by an unwanted seagull.

But again, much of the visual fidelity of these shots is close to good enough. Luma did a particularly good job of following the prompt. With the right color matching and editing, I think half of these shots could be used without being recognized as genAI. And for a technology that is hardly a year old, that is incredible.

What’s the future for AI-generated B-roll?

The simple answer is, as with everything in the generative AI space, it’s going to get a lot better. The industry is realising a simple text prompt isn’t enough to provide the kind of control filmmakers need, so we’re already seeing these tools integrating camera movements, zoom controls and more to give creatives the ability to direct the shot in many of the same ways you would a live crew..

Visual quality will continue to improve, as will the speed of models, lessening the issue of having to generate reels and reels of renders to find something useful.

It’s also worth thinking about how generative AI will impact the use of B-roll more broadly. Of course it will always be important for artistic reasons, but covering a cut or a spoiled shot could become a thing of the past. Adobe recently announced they are adding features to extend shots and remove items via smart masking to Premiere soon, so maybe you’ll no longer need to plaster over that interview shot where someone walks behind your subject – you can just have Firefly cut them out and recreate the background?

AI search is improving B-roll usability too

We’ll never – or probably never – reach a point where there’s no demand for filmed B-roll, so your huge back catalogue of material will always retain its value. And AI isn’t all about generating new material, it can help you understand, index and search your library too. In fact that’s exactly what we’re building here. Check out the demo below to see how we’re unlocking archives for broadcasters, documentarians and more.



Article credits

Originally published on

With image generation from

Playground

And TikTok creation from

Imaginario AI
]]>
Multimodal AI: what is it, and how does it work? https://imaginario.ai/wide-lens/artificial-intelligence/multimodal-ai-what-is-it-and-how-does-it-work/ Mon, 22 Jul 2024 14:14:13 +0000 https://imaginario.ai/?p=1583

Multimodality, it’s so hot right now. 2024 was the year that all the major Large Language Models – ChatGPT, Gemini, Claude and others – introduced new modalities and new ways to interact.

Like most new technological fields AI is full to the brim with technical jargon, some of it totally unnecessary, but some of it quite consequential.

Multimodal is one of the consequential ones.

So what does multimodal mean?

Well it’s actually quite simple. In AI terms, a “modality” is a type of media through which an AI model can consume, understand, and respond to information – think text, audio, image, video.

Historically most AI systems have used only text as their training data, their input and their output, and so were single modality. In the last decade or so AI image recognition systems have become more and more common with products like Google Lens and Amazon’s Rekognition. These computer vision models were obviously a step up in complexity from text-based models, but were still limited to only images, and so were also single modality.

The next evolution was text-to-image models like Stable Diffusion and DALL·E which, technically speaking, are multimodal. They take a text prompt, and produce an image – two modalities! However in practice “multimodal AI” has come to mean systems which combine two or more inputs or outputs simultaneously or alongside one another. This is sometimes called multimodal perception due to the fact that, once you introduce multiple modalities, an AI can begin to perceive (or give the impression of perception) what it is looking at, rather than just matching text or visual patterns.

Imagine providing an image recognition AI with this image. A basic system will recognise individual elements; man, woman, nose, hair. A more mature system will put them together to understand the image as a whole; a crowd watching something.

However, if you showed a multimodal system a video of a crowd watching something, it will recognise movements, facial expressions, sound effects, music and more to build a complete description of the scene.

You can think of AI modalities as roughly equivalent to human senses. There’s a lot you can do with just one sense, but when combined they provide a more complete understanding of the world around you.

How many modalities are there?

The main modalities in regular use right now are:

  • Text, which can include things like:
    • Normal written chat
    • Numerical data
    • Code such as HTML and JavaScript
  • Image
  • Video
  • Text on screen (through optical character recognition)
  • Non-dialogue audio (music, sound effects etc.)
  • Dialogue

The vast majority of AI usage is still confined to single-modality text (and the vast majority of that usage is text inputs and outputs through ChatGPT’s web interface, app and API). However as visual and audio AI systems become more mainstream, and cheaper to operate, this will no doubt change in exactly the same way the early internet was mostly text and images, but now contains a huge amount of video, music, podcasts and more.

Another curiosity of current-generation AI is that, although language and visual models can often give the impression of approaching human intelligence, they lack so many of the building blocks of intelligence that we humans take for granted. These are also modalities, which could be incorporated into AI systems in the future.

An example: the first season of HBO’s House of The Dragon takes place over almost three decades, with the lead character of Rhaenyra Targaryen played by Milly Alcock during the first five episodes, and Emma D’Arcy for the remaining four. As humans we can recognise through production cues, wardrobe, context and many other things that we’re dealing with a time jump, and that this is the same character later in life.

An AI system, not understanding as basic a concept as the passage of time, will recognise two different faces and fail to understand they are the same character.

This is one of the most interesting challenges of building AI today – we’re trying to reconstruct human intelligence but starting in the wrong place, so we have to backfill many of the fundamental aspects of basic intelligence.

We are also seeing the emergence of so-called action models, which can complete tasks on behalf of their users – for example logging into Amazon and ordering something. It’s certainly possible that actions will become another modality that is incorporated into larger models in time.

What’s the future for multimodality?

The only guarantee in the AI space right now is that the pace of innovation will continue to be relentless. Even the word multimodal only entered the public sphere around 18 months ago, and the number of people Googling it has increased tenfold in the last year.

Mobile devices and wearables is an obvious category that can benefit from multimodal models. Although the first attempts at devices incorporating multimodal AI were a huge miss, we’ll likely see these features incorporated into smartphones over time. The main limiting factor right now is the size of the models, which require a stable, fast internet connection to process queries in the cloud. This, too, will change as on-device models become more feasible.

Aside from being a nice-to-have, on-device multimodal AI has clear benefits for people with limited vision or hearing. A smartphone which can perceive the world around it is clearly useful for these groups. Imagine a visually-impaired person pointing their iPhone at a supermarket shelf and asking for help finding a specific product. Taking our human senses metaphor to its logical conclusion, these models can fill in the gaps for people who have lost those senses.

Away from consumer products, we are already seeing some really exciting developments in robotics where multimodal perception is allowing off-the-shelf robotic products to engage with the world without specific programming.

Until now industrial robots needed specific instructions for each task (close your claw 60%, raise your arm 45°, rotate 180°, etc.), but multimodal models may allow them to figure out how to complete a task themselves when given a specific outcome, by understanding the world around them and how to interact with it. Imagine a robot arm which can pick specific groceries of differing shapes and textures, and pack them taking into account how susceptible each item is to damage.

In time multimodal AI will become just another technology that we all take for granted, but right now in 2024, it’s the frontier where the most exciting artificial intelligence developments are taking place, and it’s worth keeping an eye on.



Article credits

Originally published on

With image generation from

OpenAI

And TikTok creation from

Imaginario AI
]]>
On-device vs cloud: which will unlock the full power of AI? https://imaginario.ai/wide-lens/artificial-intelligence/on-device-vs-cloud-which-will-unlock-the-full-power-of-ai/ Mon, 27 May 2024 12:16:53 +0000 https://imaginario.ai/?p=1509

Once again, AI takes the headlines. However, this time is different. We are at the beginning of a complete redesign of how humans interact with computers, possibly the biggest shift since Apple popularized the graphic user interface (GUI) or the introduction of Internet-connected devices. 

Microsoft recently announced Copilot for PCs while Apple is in talks with OpenAI to integrate GPT4o. These strategies by tech giants are setting the stage for a revolution of on-device smart assistants and AI computers. This first generation of assistants (which will be truly intelligent, unlike Alexa and Siri), will generate and perceive audio, video, images and text, plus apply live multimodal understanding and basic reasoning capabilities.

Despite current model limitations in GPT 4o, Gemini Nano from Google and Small Language Models (SLM), the first seeds of seamless interaction with on-device AI assistants have been planted. Through the use of multimodal perception and voice, AI smart assistants aim to tackle everyday tasks much more efficiently than LLMs and traditional virtual assistants. 

In other words, your laptops, tablets and mobiles are about to get a lot smarter, helping you work more efficiently and creatively by providing AI assistance and recommendations right where you need it and even in places with poor or no internet access. This is a huge step forward in making AI a core part of our daily life, making work and personal tasks easier, more productive and personalized. 

We’ve seen a few false-starts in this area, most notably the Humane AI pin and the Rabbit R1, both of which were universally panned by reviewers. The general consensus was that using an always-on Internet connection to interact with a large cloud model simply didn’t work.

New superpowers unlocked with on-device AI

Some of the main issues slowing down the adoption of AI apps is latency, the cost to train AI models, and fear of privacy (including copyright theft). With AI integrated directly into our devices, all data and inference is processed locally, at the edge, resulting in faster responses, reduced training / inference costs, and enhanced privacy since users’ data remains local.

During Microsoft’s recent announcement, one of the most impressive new features we saw (albeit controvesial) was the ability to remember anything you have done on your device. 

“We introduced memory into the PC. It’s called Recall. It’s not keyword search, it’s semantic search over all your history. It’s not about just any document. We can recreate moments from the past essentially”

said Microsoft’s CEO, Satya Nadella, in a recent interview. Basically, you can now scroll back in time to easily find apps, websites, documents, and more. 

It’s a “creepy cool” feature: great advantages unlocked as you get perfect photographic memory, but this also raises huge privacy concerns as Microsoft is taking screenshots every 5 seconds and using them to train their models. Elon Musk even said it should be turned off as it felt like a “Black Mirror” episode. 

When it comes to creative tools, the barrier to express your ideas without any technical knowledge is becoming the norm. For example, with the new Paint, PC users will now generate endless images for free and Microsoft is also partnering with Adobe, Da Vinci Resolve, Capcut, and others for improved app performance directly on Surface devices. Microsoft will also offer live captions and translations in 40 languages. 

AI assistants and live multimodal understanding

On-device AI and cloud AI both have their own benefits and challenges, and how they work is shaping the (near) future of user experiences, driven by assistants. Super quick responses means faster and more personalized voice copilots that can perceive and understand your screens and multimodal data on the go, because everything is processed right on device. 

According to OpenAI’s CEO, Sam Altman, users might want an extension of themselves; an augmented alter-ego that acts on their behalf. For example, responding to emails without even informing them. Alternatively, there’s another approach of having an assistant that acts as an experienced senior employee who has access to the user’s email and works within the constraints set by the individual. This would be a separate entity. This assistant would always be available, detail-oriented, continuously consistent, and incredibly capable.

A key capability of these agents is live multimodal understanding. This is a technology that we use for, among other things, letting you search through your video content.

A demo that illustrates these capabilities (still in alpha for GPT 4o users), is screen sharing in which Imran Khan (from Khan Academy) uses ChatGPT to guide and hint his son Sal on how to solve a mathematical problem. When giving ChatGPT access to your screen, the AI can immediately provide solutions and recommendations on the fly.

Another trend in this space, albeit realistically a few years away, are large action models. The difference with normal agents is that, in addition to retrieval and understanding, these models can orchestrate workflows, coordinate with other team members and AI assistants and be able to connect and act across multiple apps and systems (if interested check H, a company focused on Large Action Models that recently raised $220 million in seed funding).

On device models are not enough

On-device AI also has big drawbacks such as limited computing power in mobile devices, data fragmentation and compatibility across teams and databases/storage systems, and increased battery and memory consumption. This might lead to different devices running different versions of AI models, making it harder to keep UX consistent across platforms. Not small challenges.

That’s why cloud AI is still so attractive for startups and Enterprises alike. It can use vast computing resources to train and make available more complex fine-tuned AI models, provide scalable infrastructure on the go (especially for video and 3D), which would be limited on a mobile device or a PC, plus it’s guaranteed your data will not be used for training purposes. 

Cloud AI also makes it easy to update and improve apps immediately ensuring your apps stay current and fresh with the latest version. Finally, cloud AI assistants can gather data from many sources, including on-device systems, cloud storage systems and the Internet as a whole, to make the AI smarter and more accurate.

What’s next?

The efforts to develop and enhance on-device and cloud AI models and assistants is starting to converge, especially after the latest moves announced by tech giants like Microsoft and Apple (including also Samsung and Google showing similar strategies). It all indicates that combining both on-device and cloud AI technologies to power assistants and apps will be the way forward. This way, users will get the speed and privacy of on-device AI along with the power and scalability of cloud AI. 



Article credits

Originally published on

With image generation from

OpenAI
]]>
How personal AI assistants will supercharge productivity https://imaginario.ai/wide-lens/artificial-intelligence/personal-ai-assistants-supercharge-productivity/ Tue, 20 Feb 2024 10:31:03 +0000 https://imaginario.ai/?p=977

There’s no denying artificial intelligence is evolving at an astonishing pace. Across so many disciplines – the written word, photography, audio, video, 3D rendering, automating workflows – AI can do things that were unthinkable just a few years ago. A combination of fantastic academic research, powerful infrastructure, entrepreneurial vigor, billions of cash invested and – importantly – a huge amount of training data, has supercharged the field.

However for all of the amazing things AI can do, the results can undeniably be generic. It makes sense – when you train a large language model on all of the written material you can find, it will tend to produce something that’s the average of all its inputs. If you combine all of your brightest paint colors together, you’re always going to end up with a shade of brown.

But despite this limitation, its undeniable that the current generation of AI tools are the genesis of something truly exciting. The questions for those of us working in the industry is, where are we going next? How can we focus our resources and research in the right direction to make sure that the tools we develop are truly useful to people and not just technical showcases?

As you can imagine this is something we spend quite a lot of time thinking about, and I wanted to lay out our thoughts on where we go next. How can we take this nascent technology and turn it into a productivity multiplier, and something that people want to use every day?

From general purpose to genuinely personal

In most fields its accepted that you will go through a period of entry-level training to find your feet before specializing in a specific area. You have to know how to do the basics before you can perform more advanced tasks. Right now, AI is in its 101 phase. It’s learning how to answer questions like a human, draw pictures and, in our case, understand videos.

But pretty soon it’s going to be time to start specializing, and developing AI systems that can perform very specific tasks with the nuance and care of a human. And because every human has their own nuances, we believe the future is highly specialized. Personal assistants will actively learn from you in real-time and update as new information is published online (or offline in private databases). They will understand how you work, when you work, who do you work with and can take care of the manual tasks in the background while you do the stuff that humans are best at – idea generation, human connection and new ways to solve problems.

Let me give you an example. If you edit a lot of video you probably have a workflow that you follow. Record your podcast, find b-roll footage, sync audio tracks, assemble a rough cut on Adobe Premiere, tidy up the dialogue to remove pauses and stutters, resize for mobile, add overlays or captions, then export. With an AI that already understands the fundamentals of video search and editing, your personal workflow becomes extra training data to create an intelligent assistant that saves you – specifically you – time.

At Imaginario AI, we strongly believe that we are moving from tool-based workflows to task-based ones with personal agents orchestrating these apps and recommending more efficient solutions. We call this Accelerant AI. The future will also be hybrid (cloud and on-prem storage and systems) with teams collaborating virtually (in 2D and 3D) from different parts of the world; they will be searching, creating and transforming content out of multiple systems and tools.

This AI could equally apply to other fields, and in writing (by a long way the area where AI is most advanced) we’re already seeing the beginnings of this trend. Custom LLM models and third-party plugins like those offered by OpenAI and Google’s Bard can be used to come up with article outlines, talking points and even do background research for you. It’s very easy to imagine a writing app with a built-in AI that watches you work, and over time offers helps with your most frequently-performed tasks. Your assistant does the busywork, and you’re left with the fun part; the creativity!

@imaginarioai Wondering what we're building? @Jozpug had a chat with @Joey Daoud about where we see the future of AI productivity. Check out the full interview on VP Land – https://youtu.be/4WOb5Y1Qcp0 #ai #artificialintelligence #videoproduction #virtualproduction #productivity ♬ Courtroom – Jacob Yoffee

Training data needs to be ringfenced

We’re big advocates of AI, but we’re also big advocates of privacy and copyright, especially in the Media and Creator space. That’s why we never use videos uploaded to our platform to train generative models or anything that could “leak” proprietary information. And if we’re going to embrace a future of hyper-personal AI assistants, we need to take the same approach.

How I edit videos is different to how you edit videos, and the style we’re looking for in the finished product is probably very different. So quite apart from privacy concerns, an AI video editing assistant that knows my style probably won’t be of much use to you. And, to take things full-circle, if all of our editing styles are used as training data for one omnipotent AI editing assistant, all our videos will end up looking the same. We’ll all be using brown paint.

However – big detour here – that does raise the possibility of high-profile creators opening access to their personal assistants to others to learn how to imitate their style. Want to produce features like Christopher Nolan or shorts like Casey Niestat? Maybe you’ll be able to license their assistant who will highlight a certain cut and say “Casey would have done it like this”.

With the new release of Sora, OpenAI’s text to video model, we are at the dawn of a new era in personalized content creation. Once this model becomes customizable (with opt-out for data training), from a simple text-prompt you will be able to re-create content you have produced in the past or at the very least modify it for a fraction of the cost (goodbye VFX and green screens). Who knows? Even license your style to other creators.

What’s this going to look like in real life?

As with all great technologies, in time these features will recede into the background and will become a normal part of your workflows. Remember when we moved from having to manually hit “save” every so often to auto-saves happening in the background? Same thing.

Imagine you want to make a new Tiktok on baking the absolute best banana bread. You shoot your cooking shots, a selfie video explaining your method, and some B-roll of your dog helping you out (because who doesn’t love some bonus dog content?). You import all of your footage into Imaginario AI, and tell your personal assistant you’re making a TikTok about baking banana bread.

Your assistant will analyze your clips and understand the correct order (perhaps by querying a separate GPT to obtain a template banana bread recipe), and quickly build a rough cut. Because you’re making a TikTok, it will understand that your selfie video is a cutout and overlay it on top of the action. Naturally, I will add captions. Granted, it might not understand where to put the dog.

Then your job becomes more akin to a director. All the busywork has been done for you, so you can spend your time finessing the final edit to fit your vision. Maybe you want to up the pace here, add some sound effects there, add a spontaneous cutaway somewhere else.

In short, what we’re talking about is a world where the annoying, fiddly, time-consuming parts of your workflow are largely automated in a way that respects your curation criteria and editing style, and with the context of the end product you’re looking for. Your first cut will be assembled in seconds. What’s left for you is humor, emotion, plot twists – the stuff humans are best at.



Article credits

Originally published on

With image generation from

OpenAI

And TikTok creation from

Imaginario AI
]]>