Given an input like:
{'example_id': 0,
'query': ' revent 80 cfm',
'query_id': 0,
'product_id': 'B000MOO21W',
'product_locale': 'us',
'esci_label': 'I',
'small_version': 0,
'large_version': 1,
'split': 'train',
'product_title': 'Panasonic FV-20VQ3 WhisperCeiling 190 CFM Ceiling Mounted Fan',
'product_description': None,
'product_bullet_point': 'WhisperCeiling fans feature a totally enclosed condenser motor and a double-tapered, dolphin-shaped bladed blower wheel to quietly move airnDesigned to give you continuous, trouble-free operation for many years thanks in part to its high-quality components and permanently lubricated motors which wear at a slower pacenDetachable adaptors, firmly secured duct ends, adjustable mounting brackets (up to 26-in), fan/motor units that detach easily from the housing and uncomplicated wiring all lend themselves to user-friendly installationnThis Panasonic fan has a built-in damper to prevent backdraft, which helps to prevent outside air from coming through the fann0.35 amp',
'product_brand': 'Panasonic',
'product_color': 'White'}
The goal is to output something that looks like:
Panasonic FV-20VQ3 WhisperCeiling 190 CFM Ceiling Mounted Fan [TITLE] Panasonic [BRAND] White [COLOR] WhisperCeiling fans feature a totally enclosed condenser motor and a double-tapered, dolphin-shaped bladed blower wheel to quietly move air [SEP] Designed to give you continuous, trouble-free operation for many years thanks in part to its high-quality components and permanently lubricated motors which wear at a slower pace [SEP] Detachable adaptors, firmly secured duct ends, adjustable mounting brackets (up to 26-in), fan/motor units that detach easily from the housing and uncomplicated wiring all lend themselves to user-friendly installation [SEP] This Panasonic fan has a built-in damper to prevent backdraft, which helps to prevent outside air from coming through the fan [SEP] 0.35 amp [BULLETPOINT]
There’s a few operations going on to generate the desired output following the rules:
- If the values in the dictionary is None, don’t add the content to the output string
- If the values contains newline
n
substitute them with[SEP]
tokens - Concatenate the strings with in order that user specified, e.g. above follows the order
["product_title", "product_brand", "product_color", "product_bullet_point", "product_description"]
I’ve tried this that kinda works but the function I’ve written looks a little to hardcoded to look through the wanted keys and concatenate and manipulate the strings.
item1 = {'example_id': 0,
'query': ' revent 80 cfm',
'query_id': 0,
'product_id': 'B000MOO21W',
'product_locale': 'us',
'esci_label': 'I',
'small_version': 0,
'large_version': 1,
'split': 'train',
'product_title': 'Panasonic FV-20VQ3 WhisperCeiling 190 CFM Ceiling Mounted Fan',
'product_description': None,
'product_bullet_point': 'WhisperCeiling fans feature a totally enclosed condenser motor and a double-tapered, dolphin-shaped bladed blower wheel to quietly move airnDesigned to give you continuous, trouble-free operation for many years thanks in part to its high-quality components and permanently lubricated motors which wear at a slower pacenDetachable adaptors, firmly secured duct ends, adjustable mounting brackets (up to 26-in), fan/motor units that detach easily from the housing and uncomplicated wiring all lend themselves to user-friendly installationnThis Panasonic fan has a built-in damper to prevent backdraft, which helps to prevent outside air from coming through the fann0.35 amp',
'product_brand': 'Panasonic',
'product_color': 'White'}
item2 = {'example_id': 198,
'query': '# 2 pencils not sharpened',
'query_id': 6,
'product_id': 'B08KXRY4DG',
'product_locale': 'us',
'esci_label': 'S',
'small_version': 1,
'large_version': 1,
'split': 'train',
'product_title': 'AHXML#2 HB Wood Cased Graphite Pencils, Pre-Sharpened with Free Erasers, Smooth write for Exams, School, Office, Drawing and Sketching, Pack of 48',
'product_description': "<b>AHXML#2 HB Wood Cased Graphite Pencils, Pack of 48</b><br><br>Perfect for Beginners experienced graphic designers and professionals, kids Ideal for art supplies, drawing supplies, sketchbook, sketch pad, shading pencil, artist pencil, school supplies. <br><br><b>Package Includes</b><br>- 48 x Sketching Pencil<br> - 1 x Paper Boxed packaging<br><br>Our high quality, hexagonal shape is super lightweight and textured, producing smooth marks that erase well, and do not break off when you're drawing.<br><br><b>If you have any question or suggestion during using, please feel free to contact us.</b>",
'product_bullet_point': '#2 HB yellow, wood-cased pencils:Box of 48 count. Made from high quality real poplar wood and 100% genuine graphite pencil core. These No 2 pencils come with 100% Non-Toxic latex free pink top erasers.nPRE-SHARPENED & EASY SHARPENING: All the 48 count pencils are pre-sharpened, ready to use when get it, saving your time of preparing.nThese writing instruments are hexagonal in shape to ensure a comfortable grip when writing, scribbling, or doodling.nThey are widely used in daily writhing, sketching, examination, marking, and more, especially for kids and teen writing in classroom and home.#2 HB wood-cased yellow pencils in bulk are ideal choice for school, office and home to maintain daily pencil consumption.nCustomer service:If you are not satisfied with our product or have any questions, please feel free to contact us.',
'product_brand': 'AHXML',
'product_color': None}
def product2str(row, keys):
key2token = {'product_title': '[TITLE]',
'product_brand': '[BRAND]',
'product_color': '[COLOR]',
'product_bullet_point': '[BULLETPOINT]',
'product_description': '[DESCRIPTION]'}
output = ""
for k in keys:
content = row[k]
if content:
output += content.replace('n', ' [SEP] ') + f" {key2token[k]} "
return output.strip()
product2str(item2, keys=['product_title', 'product_brand', 'product_color',
'product_bullet_point', 'product_description'])
Q: Is there some sort of native CPython JSON to str flatten functions/recipes that can achieve similar results to the product2str
function?
Q: Or is there already some function/pipeline in tokenizers
library https://pypi.org/project/tokenizers/ that can flatten a JSON/dict into tokens?
2
Answers
So I made this function that would do what you asked for
So basically I do what you have done, BUT, instead of making a string, I make a list, and I join it later on.
OUTPUT:
To me it seems to be crystal clear that
keys
should be a global variable, I guess you would call the function with the samekeys
argument repeatedly, so it would be better if you make it a global and not pass it as an argument unnecessarily.Your tokens follow a clear pattern, you are removing
'product_'
prefix and removing underscores and then convert to UPPERCASE, why not make a function to do this?And while you can use
dict
comprehension to pre-generate the tokens, I advise against it because there wouldn’t be any significant performance gain and you would do an implicit loop every time you query thatdict
.I have shortened your code to this:
I am afraid there isn’t anything more that can be done, to the best of my knowledge.
Using your examples I get the following outputs: