I am trying to filter out data that is stored as strings but represents numeric values.
I filter out non-numeric strings then convert to a float. I am then trying to filter values above or below a specific amount. Below is a simplified version of my query that works with the filters disabled
WITH attributes AS (
SELECT
property_id,
CASE WHEN regexp_replace(value, '[^0-9.]', '')='' THEN
null
WHEN regexp_replace(value, '[^0-9.]', '') = '.' THEN NULL
ELSE
regexp_replace(value, '[^0-9.]', '')::float
END AS sqft, 'platform' FROM mv_prop_attributes
WHERE
display_name = 'Livable Area'
and sqft is not null
and length(sqft)>0
)
SELECT
*
FROM
attributes
WHERE true
-- and sqft < 100000 --works
-- and sqft>200 --works
-- and sqft < 100000 and sqft>200 --does not work
-- and sqft between 200 and 100000 --does not work
As noted in the commented portions I can filter above or below a value but not a range. Attempts to do so produce the following error:
ERROR: Invalid digit, Value ‘.’, Pos 0, Type: Double
DETAIL:
error: Invalid digit, Value ‘.’, Pos 0, Type: Double
code: 1207
context: .
query: 438646641
location: :0
process: query0_117_438646641 [pid=0]
This is running on an up to date redshift cluster. Thank you for any suggestions on how to resolve this at the DB level.
2
Answers
I think the problem is likely that you didn’t account for a period (vs. a decimal point). The string "one thousand square feet." would come through your SQL as a single ‘.’ which cannot be converted to a float.
I suspect that you aren’t getting the error with inequality as Redshift can push down WHERE clauses where it can (but this case seems extreme). Inequality can be applied to string – ‘1’ < ‘2’ – but this doesn’t work the same for ranges. This could be letting the query run but is a long shot.
==================================================================
A way to test if the source data has extra ‘.’s in it would be to run:
Try regexp_replace(value, ‘[^0-9.]’, ”,1,0,’g’) – replace all non numeric symbols and ‘.’, not only first.