Overview
Currently, I use the GoogleGenerativeAI library to handle generative AI prompt generation requests in my application. Gemini promises to be a multi-modal AI model, and I’d like to enable my users to send files (e.g. PDFs, images, .xls files) in line with their AI prompts.
I was using the following workflow to enable people to upload a file and use it in a prompt:
- Enable file selection from their local machine (e.g. PDFs, .doc, .xls formatted files).
- Upload the file to Google Cloud Storage, get an accessible link to the newly-uploaded file.
- Send the request to Gemini with the link to the file included in the prompt (where appropriate).
However, now I’m finding that this solution doesn’t work as it was. Instead, I’m seeing responses like this:
I lack the ability to access external websites or specific files from
given URLs, including the one you provided from Google Cloud Storage.
Therefore, I’m unable to summarize the content of the file.
What I’ve Considered
- Using multiple libraries to handle document types client-side to convert them into text (e.g.
pdf-parser
for PDFs) and using Gemini’s image-handling model when there’s an image involved. However, this involves lots of libraries, and it seems that Gemini is promising to handle this for me / my users. - Pre-processing the uploaded files server-side (for example, sending them to Google’s Document AI), turning their document into some type of consistently-structured data, then using that data with the GoogleGenerativeAI library. Document AI calls are expensive though and it seems that Gemini is meant to handle this kind of thing.
My App’s Stack (In Case it Matters)
- Firebase / Google Cloud Functions
- Vercel
- Next.js
Can you help with an approach that will enable the user to include files in their requests made (via the web) to Gemini?
Thanks in advance!
2
Answers
The documentation on generating text from text-and-image input (multimodal) has an example of how to include image data in a request.
As Guillaume commented, this requires that you include your image data as a base64 encoded part in your request. While I didn’t test the JavaScript bindings myself yet, this matches with my experience of using Dart bindings – where I also included the images as base64 encoded parts.
The multi-modal capabilities of Gemini are currently limited, and they are slightly different if you are using the Google AI Studio version of the library or the Google Cloud Vertex AI version of the library.
Currently, neither library supports other modalities, including PDFs, doc files, spreadsheets, etc. While these may be available in the future, they’re not available today.