Models

Currently Geppetto supports 2 models.

  1. Moondream 2 (05-08-2024 version) - Which provides seeing in the API
  2. Piper - Which provides speaking in the API

Moondream 2 - Seeing

Moondream 2 (opens in a new tab) is a vision language model by vik (opens in a new tab). It is very small and designed to be run on the edge. The performance of moondream2 is quite good given it's small size.

ReleaseVQAv2GQATextVQATallyQA (simple)TallyQA (full)
2024-05-0879.062.753.181.676.1

Moondream 2 supports images of size 378x378. Any images which are larger will be downscaled to fit in this resolution and then processed.

Piper - Speaking

Piper (opens in a new tab) is a text-to-speech model which was optimized to run on the Raspberry Pi. Because of this inference times are very fast, even on CPU.

Many many voices are available for Piper which you can find here (opens in a new tab). Currently Geppetto supports 6 voices. If you have requests for more voices, please let us know at [email protected]

Whisper - Hearing

Whisper (opens in a new tab) is a SOTA speech-to-text model by OpenAI. Currently Geppetto offers support for whisper-tiny. Larger models may be offfered in the future.

By using whisper-tiny we get extremely fast transcriptions for low latency applications with good performance.