Meet the Builders · APAC
Giving Goa's heritage a voice, one photo at a time
How I used Google Gemini, Gemma and text-to-speech to turn any photo of a Goan church, fort or family heirloom into a short story you can listen to.
By Carson Rodrigues · Margao, Goa
The problem
Goa has some of the oldest churches in India and the largest surviving collection of Portuguese-colonial architecture in Asia. Walk through Old Goa and you are standing inside 16th-century history. But here is the quiet problem: most people walking past a 400-year-old facade — locals included — have no idea what it is, who built it, or why it matters. The buildings survive; the stories don't.
The same is true in people's homes. Almost every Goan family has a tin of old photographs — a wedding outside a village chapel, a grandfather in a borrowed suit, a feast-day procession — and within a generation, nobody is left who can say where or when they were taken.
What I built
Virasat (Hindi/Urdu for heritage) is a single web page. You drop in any photo of Goan or Indian heritage — a church, a fort, a monument, an old portrait — and it does three things:
- Identifies it. It names the subject when it can confidently recognise it, and describes the architectural style and likely era when it cannot — rather than making things up.
- Tells its story. It writes a short, spoken-style history: who built it, when, and what it meant.
- Reads it aloud. It narrates that story in a warm voice, so a heritage photo becomes something you listen to, not just read.
No account, nothing stored on a server. It is meant to feel like standing next to someone who knows the history and is happy to tell you.
How it works — built on Google AI
The whole thing runs on Google's models, through the Google AI Studio Gemini API — and deliberately uses three different ones, each for what it is best at:
- Gemini 3.5 Flashlooks at the photo and writes the narration — the main story, plus a short title. Its prose is clean and warm, and it is careful to describe rather than fabricate when it isn't sure.
- Gemma 4(Google's open model) runs a second, independent identification pass — a one-line “what and where” — so the result isn't resting on a single model's opinion.
- Gemini 2.5 Flash text-to-speech reads the story aloud. It returns raw 16-bit PCM audio, which the server wraps in a WAV header so it plays straight in the browser.
All three calls run in parallel from one server route, so the story, the identification and the audio arrive together. The front end is Next.js, deployed on Vercel — lightweight on purpose, because the interesting part is the models, not the plumbing.
An honest note on scope
My first idea was AI photo restoration— repairing and colourising faded photos. Google's image-generation models can do this beautifully, but on the free tier they are paywalled, and I wanted this to be genuinely free for anyone in Goa to use. So I built what the free models do brilliantly: understanding an image and speaking. It is a smaller idea, but a real and useful one — and it costs nothing to run.
Why it matters here
This is the kind of thing generative AI is genuinely good for: not replacing a guide or a historian, but putting a little of their knowledge in everyone's pocket. A tourist can understand the chapel in front of them. A student can hear the history of a monument in their own town. A family can attach a story to a photo before the last person who remembers it is gone.
What's next
- Narration in Konkani and Marathi, in the language of the people in the photo.
- A growing gallery of narrated Goan landmarks.
- Photo restoration, once I wire up a billing-enabled key for the image models.
Tools & technologies used: Google Gemini 3.5 Flash, Google Gemma 4, Google Gemini 2.5 Flash text-to-speech, Google AI Studio (Gemini API), Next.js, Vercel.