For a bit of context, I recently started working on a personal project that accepts the URL of some recipe web page, pulls the HTML, converts the HTML to simplified markdown (this is the GPT-3 part), then sends that markdown to a thermal receipt printer in my kitchen, which prints it out.
Recipe web pages have a wide variety of structures, and they are notorious for including long and often irrelevant articles before the recipe, for the sake of SEO.
My plan was to use the fine-tuning API for davinci2, and feed it a bunch of straight up recipe HTML as input and cleaned, recipe-only markdown as output. I notice though that the maximum input token count for both training and inference is 4096. The HTML for a web page can be much larger than that, like 20k tokens.
I am wondering if anyone has found a workaround for training and driving GPT-3 with more tokens than 4096.
I’m open to other suggestions as well. For instance, I’ve considered passing just the visible text on the page, rather than the full HTML tree, but there is much less context present in that form, and the models seems more easily confused by all of the links and other navigational elements present in the page. I have also considered only allowing this project to accept "printer-friendly" versions of recipes, which tend to be much smaller and would easily come in under the 4096 token limit, but not all sites offer a printer-friendly article, and I don’t want this to be a limitation.
2
Answers
Do not know of any work arounds but have you thought of perhaps filtering the HTML elements out based on some basic rules. You can include only paragraph elements or
elements that have certain characteristics, like having a list within them, which is something most recipes have.
this framework might be useful to you: https://github.com/Xpitfire/symbolicai
The basic idea is:
I know this a bit high-level explained, but maybe if you follow the link I provided, things might get more clear.