I’m working on a Python script that extracts JSON content from a message string. The function is designed to parse a JSON block enclosed within json tags. Below is the JSON input and the script I’m using:
JSON Input:
json
{
"description": "Crie um novo projeto chamado 'ProjetoTeste1'.",
"code": "local new_project = NewProject('ProjetoTeste1')
if new_project then
print('Projeto criado com sucesso: ' .. new_project.name)
else
print('Falha ao criar o projeto')
end"
}
Python Script:
import re
import json
from langchain_core.messages import AIMessage
class CodeSolution:
def __init__(self, prefix: str, code: str):
self.prefix = prefix
self.code = code
def escape_code_field(code_str):
# Escape backslashes first
code_str = code_str.replace('\', '\\')
# Escape double quotes
code_str = code_str.replace('"', '\"')
# Escape newlines
code_str = code_str.replace('n', '\n')
return code_str
def unescape_code_field(code_str):
# Unescape newlines
code_str = code_str.replace('\n', 'n')
# Unescape double quotes
code_str = code_str.replace('\"', '"')
# Unescape backslashes
code_str = code_str.replace('\\', '\')
return code_str
def extract_json(message) -> Any:
text = message.content
print("Message content:")
print(text)
pattern = r"```jsons*({.*?})s*```"
matches = re.findall(pattern, text, re.DOTALL)
if not matches:
raise ValueError("No JSON content found in the message.")
json_content = matches[0].strip()
# Escape the code field
code_pattern = r'("code"s*:s*")([sS]*?)("s*[,}])'
def replace_code(match):
code_value = match.group(2)
escaped_code = escape_code_field(code_value)
return f'{match.group(1)}{escaped_code}{match.group(3)}'
json_content_escaped = re.sub(code_pattern, replace_code, json_content)
try:
parsed = json.loads(json_content_escaped)
except json.JSONDecodeError as e:
print(f"Error parsing content with JSON: {e}")
print("Escaped JSON content:")
print(json_content_escaped)
raise ValueError(f"Failed to parse JSON content: {e}") from e
prefix = parsed.get('description', 'No description available')
code = parsed.get('code', '')
if not code:
raise ValueError("No 'code' field found in the parsed JSON.")
code = unescape_code_field(code)
return CodeSolution(prefix=prefix, code=code)
Error Traceback:
ValueError: No JSON content found in the message.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
...
ValueError: No JSON content found in the message.
Additional Information:
The JSON that works is more complex and includes nested structures, whereas the failing JSON is simpler.
Both JSON blocks are correctly formatted and enclosed within json tags in the message content.
I adapted the script to handle the second JSON, but it still fails to recognize it.
Working JSON Example:
json
{
"description": "Obtaining the production of the well P10 in the model 'Modelo10' of the project 'ProjetoTeste10'",
"code": "local projects = GetProjects()
local project = projects['ProjetoTeste10']
if project then
local model = project.flux['Modelo10']
if model then
local well = model.well['P10']
if well then
local np = well.data['NP']
if np then
print('Produção acumulada do poço P10: ' .. np[#np])
else
print('Dados de produção não encontrados para o poço P10.')
end
else
print('Poço P10 não encontrado no modelo "Modelo10".')
end
else
print('Modelo "Modelo10" não encontrado no projeto "ProjetoTeste10".')
end
else
print('Projeto "ProjetoTeste10" não encontrado.')
end"
}
What I’ve Tried:
- Verified that the JSON is correctly formatted. Checked the regex
pattern to ensure it accurately captures the JSON block. Added print
statements to debug and confirm the message content.
Request for Help:
Why is the script unable to find JSON content in the simpler JSON example while successfully parsing the more complex one? Is there an issue with the regex pattern or the way the JSON is being processed? Any guidance on how to fix this would be greatly appreciated. Thank you!
2
Answers
I’m happy to help but please provide more info, as I think you have something set up wrong(or are using the wrong tools). What exactly is this code doing?
What version of Python are you running? I was able to get it working with Python 3.12 on Linux Mint 22. I may be misunderstanding, but when I run your code with this main function:
I got the output:
Below solution works with your examples.
What I changed:
-> Any
fromextract_json
function.pattern
inextract_json
.| re.MULTILINE
tomatches
.Code:
Output:
Note that it would be more simple to write proper JSON in the first place by creating a Python dictionary of the description and code. Use a triple-quoted string to create the code string and the JSON will be written with the newlines properly escaped. Then you could directly loads the JSON without extra handling.
Example:
Output: