Jacob’s Tech Tavern

Jacob’s Tech Tavern

Share this post

Jacob’s Tech Tavern
Jacob’s Tech Tavern
Customised client-side AI with Apple's CLIP models
Copy link
Facebook
Email
Notes
More

Customised client-side AI with Apple's CLIP models

Perform Core ML magic, entirely offline

Jacob Bartlett's avatar
Jacob Bartlett
Jun 10, 2025
∙ Paid
8

Share this post

Jacob’s Tech Tavern
Jacob’s Tech Tavern
Customised client-side AI with Apple's CLIP models
Copy link
Facebook
Email
Notes
More
2
Share

Client-side AI has come a long way since 2017, when Apple released Core ML 1.0. Long before Foundation Models, Core ML has been quietly maturing into a powerful tool for those with the know-how to use it.

Last week, I unleashed Summon Self to the world, which allowed you to create trading cards out of your friends. These had fitting names and stats powered by customised client-side CLIP models. These models were open-sourced by Apple and perform zero-shot image classification on your photos to create fitting names and stats.

This tech simply didn’t exist until recently.

Today, we’re going to learn full-stack client-side AI:

  • Finding the right model for your use case

  • Downloading the open-source model

  • Customising the model using Python scripts

  • Embedding models in your app

  • Using the Vision framework to process results

These surprisingly simple techniques focus on CLIP, but apply to any Core ML model. From today, you’ll unlock powerful AI features in your own apps while paying $0 for servers.


Apple’s AI Models

Apple has a plethora of oven-ready Core ML models available for you to use in your apps. These include:

FastViT: an image classification model

YOLOv3: an object detection model

Depth Anything V2: a depth estimation model*

*at some point I want to turn this into a spectral ghost camera

But this only scratches the surface. Apple’s most interesting set of models live on HuggingFace, the community for ML professionals and enthusiasts to host and share their own models and training data sets.

Included in this list are the MobileCLIP models, Apple’s own implementation of OpenAI’s CLIP neural network, optimised for Core ML and iOS.

CLIP Models

CLIP, or Contrastive Language–Image Pretraining, is a neural network developed by OpenAI that connects images with text. It trains on giant datasets of text-image pairs scraped from the internet.

The CLIP neural network trains a pair of twin models: an image encoder and a text encoder. These are taught to map images and text into a shared “embedding”; effectively creating vectors representing a point in 512-dimensional space.

This is a form of zero-shot image classification, meaning the trained model can classify images and text it has never seen before and get pretty good results.

Apple’s MobileCLIP models are trained using the DataCompDR training data, which has a relatively small number (12 million) of high-quality image-text pairs, allowing for a fast, relatively small model to be produced. This is useful for efficient performance on mobile devices.

Actually, there are 4 MobileCLIP model pairs Apple created, which vary in size and accuracy. S0, S1, S2, and B(LT).

Frankly, however, I have only got great results with the largest model, MobileCLIP B(LT). This powered the card naming and stats in Summon Self, my latest indie project. I combined it with Vision processing and SwiftUI Metal shaders to mint trading cards you can collect and trade with your friends. Read my full write-up here.

fa504dfd-5a46-4de6-a905-5c975ae2ae4f_1656x1198.png.webp

MobileCLIP B(LT) works like magic, but also means you are saddling yourself with a 173MB image processing model. This means 95% of the app size of Summon Self was taken up by this model!

First, I’m going to demonstrate how to generate text embeddings on your computer running a simple Python script with the Text model. Next, we’re going to use the Vision framework to process images and classify them on-device.

If you also bundle the text models, you can introduce user-customisable categories that generate text embeddings on-the-fly. One of the most exciting applications of this is Image Search, where text embeddings match with the closest images in real-time as a user types what they want to see.

Once you know how to do this with MobileCLIP, you can apply the exact same technique to all Apple’s free Core ML models, unlocking powerful client-side AI features in your own apps.

Subscribe to Jacob’s Tech Tavern for free to get ludicrously in-depth articles on iOS, Swift, tech, & indie projects in your inbox every week.

Paid subscribers get full access to this article, plus my full Quick Hacks library. You’ll also get all my long-form content 3 weeks before anyone else.

Get 30% off forever

Not sure? How about a 2 week test-drive?

Get 14 day free trial

Keep reading with a 7-day free trial

Subscribe to Jacob’s Tech Tavern to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 Jacob Bartlett
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Copy link
Facebook
Email
Notes
More