skip to Main Content

For a bit of context, I recently started working on a personal project that accepts the URL of some recipe web page, pulls the HTML, converts the HTML to simplified markdown (this is the GPT-3 part), then sends that markdown to a thermal receipt printer in my kitchen, which prints it out.

Recipe web pages have a wide variety of structures, and they are notorious for including long and often irrelevant articles before the recipe, for the sake of SEO.

My plan was to use the fine-tuning API for davinci2, and feed it a bunch of straight up recipe HTML as input and cleaned, recipe-only markdown as output. I notice though that the maximum input token count for both training and inference is 4096. The HTML for a web page can be much larger than that, like 20k tokens.

I am wondering if anyone has found a workaround for training and driving GPT-3 with more tokens than 4096.

I’m open to other suggestions as well. For instance, I’ve considered passing just the visible text on the page, rather than the full HTML tree, but there is much less context present in that form, and the models seems more easily confused by all of the links and other navigational elements present in the page. I have also considered only allowing this project to accept "printer-friendly" versions of recipes, which tend to be much smaller and would easily come in under the 4096 token limit, but not all sites offer a printer-friendly article, and I don’t want this to be a limitation.

2

Answers


  1. Do not know of any work arounds but have you thought of perhaps filtering the HTML elements out based on some basic rules. You can include only paragraph elements or

    elements that have certain characteristics, like having a list within them, which is something most recipes have.

    Login or Signup to reply.
  2. this framework might be useful to you: https://github.com/Xpitfire/symbolicai

    The basic idea is:

    1. You could stream among your input data and build up a stack on the side.
    2. Next, in your training procedure, you need to account for having loosely connected chunks of data. This you could overcome by indexing or clustering the chunks before designing your prompts.
    3. This means, if you want to create a query for a question that is related to your long data stream, you could search through your indexes and retrieve the related information.
    4. Now you need to parse together your few-shot learning prompt that accounts for a "section" in your prompt that relates to your query and another one for the facts you wanted to include.
    5. Finally, you can then feed that into your model and provide examples of what you want your model to be tuned to.

    I know this a bit high-level explained, but maybe if you follow the link I provided, things might get more clear.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search