skip to Main Content

I need to implement a simple text classification using regex, and for this I though to apply a simple CASE WHEN statement, but rather than in case 1 condition is met, I want to iterate over all the CASEs

for example

with `table` as(
SELECT 'It is undeniable that AI will change the landscape of the future. There is a frequent increase in the demand for AI-related jobs, especially in data science and machine learning positions. It is believed that artificial intelligence will change the world, just like how electricity changed the world about 100 years ago. As Professor Andrew NG has famously stated multiple times “Artificial Intelligence is the new electricity.” We have advanced immensely in the field of artificial intelligence. With the increase in the processing and computational power, thanks to graphical processing units (GPUs), and also due to the abundance of data, we have reached a position of supremacy in Deep Learning and modern algorithms.' as text
)
SELECT 
  CASE 
    WHEN REGEXP_CONTAINS(text, r'(?i)ai') THEN 'AI'
    WHEN REGEXP_CONTAINS(text, r'(?i)computational power') THEN 'Engineering'
    WHEN REGEXP_CONTAINS(text, r'(?i)deep learning') THEN 'Deep Learning'
  END as topic,
  text
FROM `table`

with this query, the text is classified as AI because is the first condition that is met, but it should be classified as AI, Engineering and Deep Learning in an Array or in 3 different rows, because all 3 conditions are met.

How can I classify the text applying all the regex/conditions?

3

Answers


  1. One method is string concatenation:

    SELECT CONCAT(CASE WHEN REGEXP_CONTAINS(text, r'(?i)ai') THEN 'AI;' ELSE '' END,
                  CASE WHEN REGEXP_CONTAINS(text, r'(?i)computational power') THEN 'Engineering;' ELSE '' END,
                  CASE WHEN REGEXP_CONTAINS(text, r'(?i)deep learning') THEN 'Deep Learning;' ELSE '' END
                ) as topics, text
    FROM `table`;
    

    Actually, this constructs a string. You can use similar-ish logic to construct an array instead.

    Login or Signup to reply.
  2. Below is for BigQuery Standard SQL

    #standardSQL
    select 
       array_to_string(array(select distinct lower(topic)
        from unnest(regexp_extract_all(text, r'(?i)ai|computational power|deep learning')) topic
       ), ', ') topics,
      text
    from `table`   
    

    if to apply to sample data from your question – output is

    enter image description here

    Login or Signup to reply.
  3. I feel below is most generic and reusable solution (BigQuery Standard SQL)

    #standardSQL
    with `table` as(
    select 'It is undeniable that AI will change the landscape of the future. There is a frequent increase in the demand for AI-related jobs, especially in data science and machine learning positions. It is believed that artificial intelligence will change the world, just like how electricity changed the world about 100 years ago. As Professor Andrew NG has famously stated multiple times “Artificial Intelligence is the new electricity.” We have advanced immensely in the field of artificial intelligence. With the increase in the processing and computational power, thanks to graphical processing units (GPUs), and also due to the abundance of data, we have reached a position of supremacy in Deep Learning and modern algorithms.' as text
    ), classification as (
      select 'ai' term, 'AI' topic union all
      select 'computational power', 'Engineering' union all
      select 'deep learning', 'Deep Learning'  
    ), pattern as (
      select r'(?i)' || string_agg(term, '|') as regexp_pattern
      from classification
    )
    select 
       array_to_string(array(
        select distinct topic
        from unnest(regexp_extract_all(lower(text), regexp_pattern)) term
        join classification using(term)
       ), ', ') topics,
      text
    from `table`, pattern     
    

    with output

    enter image description here

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search