From idea to code

Building an AI Audio Guide - Part 1

Iury Souza
Coding is communicating
More about me

Iury Souza

09 Jan 2024•10 min read

This is the first part of a series of articles about prototyping an AI audio guide app. Looking for the other ones?

Happy new year! A bit late, but here we are. This is the first part of a series of articles where I’ll be writing about my first experience building an AI-powered mobile app. This was a fun learning experience and I hope to share something useful here.

Now, first, a bit of a background story. You can safely skip this part if you want.

Background story

I’m the kind of person who’s naturally curious, I can nerd out about literally anything and that’s probably why I’ve picked up so many hobbies over the years. It happens that I got into tech, so I’ve been nerding about that the longest. Always keeping an eye on the ever-changing trends in the industry.

My curiosity naturally drew me to AI around 2017, although from a distance. I remember talking about it with a flatmate (and great friend) of mine every evening. We’d watch videos on the new developments and the new papers coming out. I even convinced him to join me in Udacity’s AI Foundations Nanodegree, which we completed in 2018, basic stuff obviously, it was clear to me that this field would boom pretty soon.

Now, you can imagine how amazed I was when ChatGPT 2 came out, let alone ChatGPT 3. In 2023 AI went mainstream like we have never seen before. The hype was on par with – maybe even crazier than – all that crypto buzz, but this time actually useful products were being built.

The Chat-GPT Hype-train

I signed up as soon as it launched. For a while I obsessed with prompt engineering hacks and trying to see what this thing could do and how I could squeeze this new tech into my workflow. Writing, translation, coding, learning new things, you name it.

The next AI milestone for me was the Raycast Pro which added a chat-gpt wrapper interface. Now I could definitely incorporate it to my day to day tasks. Having tried lots of chrome extensions, Obsidian plugins and what not, this one felt natural. Now I was really using AI a lot for mundane tasks.

In my experience I think you really need to try and use this tech for you to really see what it’s capable off. A lot of smart people around me were writing it off because they had and incompatible mental model of it.

Sure, I hadn’t yet had that mind-blowing lightbulb moment, especially for anything I could build as a mobile engineer. Most apps I saw were about chatbots or image generating, which didn’t really click with me.

Then, comes Q4 and the multimodal models start getting released. So, now ChatGPT can understand pictures? Ok. This is interesting! I took a look at the OpenAI API docs – that vision API seemed crazy simple. So one chilly December night, I figured, why not give it a shot?

Picking the right tools

So, the first decision is which language to use, right? OpenAI has official SDKs for Python and JavaScript. I could work with both, but they always feel a bit like speaking a second language – I get by, but there are too many little mistakes and one thing I’ve learned over the years is that learning multiple things on a new project is going to slow you down. So, before picking one of those I researched Kotlin SDKs and found the openai-kotlin library: A kotlin multiplatform alternative that offers almost every feature of the official sdks, so I went with it.

✅ Check!

Inspiration

Ok, now what to build?

Well, I had just spent a week in Andalusia, South of Spain and had the fresh memory of walking around the city and juggling between multiple apps to figure out the monuments and what not.

Imagine this: You’re walking around Granada and want to know more about you’re looking at. You’re visiting, the Alhambra? Great, they’ve got an audio guide app web app. Next day, you’re at a medieval church in the city. Great, another app. Then you stumble upon this amazing monument – no app, so you’re stuck Googling (or asking ChatGPT). And sometimes you just see something incredible but have zero clue what it is – cue frantically searching Google Maps to even find the name!

Okay, sure, this is what happens when you don’t plan ahead. Guilty as charged. But still, there’s a problem to solve here.

I knew what to build.

What if there was an app where you snap a photo and ✨! – instant audio guide?

Honestly, even if I was the only one who used it, that sounded awesome. Alright then, time to roll up the sleeves!

1st Prototype: Just use ChatGPT

Now, the beauty of this Generative AI thing is that it’s super easy to prototype. This is what I did:

Went to open AI platform page. (This is important because, the ChatGPT ui that most people are used to is a product built on top of their APIs, so you need to use this platform to simulate the answers you’d get from their apis).
Uploaded a picture of the Berlin Brandenburg Gate,
Added a few instructions to the prompt

And there you have it! Not perfect, but a surprisingly solid answer. I tested a few other landmarks and it kept working well enough.

I was sold on this idea and these tests gave me confidence to start building the app. But not a mobile app, not just yet.

2nd Prototype: A Kotlin App

Instead of jumping head first into building the android app, I wanted to first iterate on the core implementation using the simplest approach.

This should be really straight forward:

Call the API passing an image file and a prompt.
Explore the ergonomics. E.g.:
- File uploading via the api
- How does text streaming works?
- How long does it take to answer
- How to integrate with the TTS (Text-To-Speech) API

You see, spinning up an emulator, or the basic infrastructure to test this basic flow of on a native android app is not as straight forward as calling a main function and that’s what I wanted.

And this is literally what the code looked like:

object TTS {
    suspend fun run(
        text: String,
    ): Result<Path> = runCatching {
        val openAI = getInstance()
        val request = speechRequest {
            model = ModelId("tts-1")
            input = text
            voice = Voice.Shimmer
            this.speed = 1.15
        }
        val audio = openAI.speech(request)
        writeToFile(audio)
    }
}


object AudioGuide {
    suspend fun run(
        input: TourGuideInput,
    ): Result<String> = runCatching {
        val openAI = getInstance()
        val request = chatCompletionRequest {
            model = ModelId("gpt-4-vision-preview")
            messages {
                user {
                    content {
                        text(buildPrompt())
                        image(input.imageUrl)
                    }
                }
            }
            maxTokens = 600
        }
         openAI.chatCompletion(request)
            .choices
            .first()
            .message
            .content
            .orEmpty()
    }
}

suspend fun main() {
    val input = Input(
        city = "Berlin",
        imageUrl = "https://static.dw.com/image/52796179_605.jpg",
    )
    AudioGuide.run(input)
        .onFailure { it.printStackTrace() }
        .onSuccess { response: String ->
            log(
                """
                        | Parameter | Value |
                        |---|---|
                        | City | ${input.city} |
                        | ImageUrl | ${input.imageUrl} |
                        | Output | $response |
                """.trimIndent()
            )
            TTS.run(response)
        }
}

And that was it. I built this on that very evening and was impressed with the results.

Improving the prototype

So that prototype worked, but as soon as you start using pictures that are not super clear, or basically whenever you ask about not so famous sights, then the magic starts to break off.

Which makes sense. This thing is not magic, it has lots of flaws, but if you give it enough context things start to improve. On the example above you can see that I already was telling it about the city where it was. This helped clear out a lot of confusion because it reduces the search space from anywhere in the world to just one city. But, can we do better?

Well, the idea was to eventually build a mobile app and if we’re an app, then we can get the user’s geolocation coordinates. We can also do Reverse geocoding (address lookup)! Telling it the city and street name would be nice… but upon testing this (by manually putting the address into the prompts) I saw that it still didn’t solve the problem.

Next idea: Google Places SDK! I could use that to find the tourist attractions nearby. It’s a plan! Now we:

Send the AI a picture
create a nice prompt telling it:
- The address where it was taken
- The top attractions nearby

Also, I decided that I was going to focus on a single city for this prototype. Since I was dependent on google places api, I couldn’t trust the quality of its answers for just any city (same for chat-gpt). So, I decided to focus on Berlin, the city I live in.

Validating the Solution

To validate all these assumptions, I still didn’t want to invest the time into implementing it, so I hardcoded the data.

I researched a bunch of hidden landmarks examples and used their data to feed into the LLM and then recorded the answers. Then, looking at the output, I was again happy with it.

This is an example of the output I got:

City	Berlin
Address	Bornholmer Str. 61, 10439 Berlin
TopLocations	Place of 9 November, Bosebrucke, Kirschblutenpfad, Scheifengarten
ImageUrl	https://berlintraveltips.com/wp-content/uploads/2021/07/Platz-des-9-November-Berlin-free-activities.jpg
Output	Right before your eyes, you’re looking at a powerful photographic display located near the historic Bornholmer Strasse in Berlin, which is part of the larger Place of 9 November installation. This particular image captures an extraordinary moment in history, an intimate glimpse into the jubilant faces of East Berliners at the Bornholmer Strasse border crossing on the night of November 9, 1989 – a night that altered the course of history. This was the first crossing to open during the fall of the Berlin Wall, marking a pivotal point in the reunification of Germany. People streamed through here, in disbelief and elation, as the barriers that once divided a city, and symbolically the world, were lifted. In this surrounding area, the picturesque Kirschblutenpfad, or Cherry Blossom Path, awaits you during the spring, standing in contrast to the Bosebrucke, another witness to the changes that swept across this city. This installation invites contemplation and serves as a portal to the past, allowing visitors to stand amidst history retold through imagery. As you wander alongside this precious slice of the city’s timeline, you can almost hear the echoes of cheers and the rustle of footsteps that marked the beginning of a new era. It’s not just a photograph, it is a testimony to unity, to change, and to the enduring spirit of Berliners.

Now I felt confident that this would at least make a good prototype.

Wrap up

Starting small, we can use generative AI like ChatGPT to play around, refine ideas, and turn them into something interesting. The key is - prototype and iterate!

By now, the mobile app is finished. But these series of articles are intended to document this process and hopefully give you insights along the way.

Jump to Part 2