Question posted in Json
Our archive of expertly curated questions and answers provides insights and solutions to common problems related to this popular data interchange format. From parsing and manipulating JSON data to integrating it with various programming languages and web services, our archive has got you covered. Start exploring today and take your JSON skills to the next level

Json – Use Apache Beam to construct new column with unique ID

sk97
February 25, 2023
204 views
0 votes
2 Answers

I’m quite new to Apache Beam and I’m trying to create a new column at index 0 that uses groupbykey on multiple columns and construct a new unique id. How am I able to achieve this?

Also I want to write the new data to a newline delimited JSON format file (where each line is one unique_id with an array of objects that belong to that unique_id.

I’ve currently written:

import apache_beam as beam

pipe = beam.Pipeline()

id = (pipe
            |beam.io.ReadFromText('data.csv')
            |beam.Map(lambda x:x.split(","))
            |beam.Map(print))

Which basically converts each row into a list of strings.

This post has the sample data input and the solutions use pandas to do so but how do I achieve the same in the pipeline using Beam?

Thank you!

Answers

- XQHu
- February 26, 2023 at 4:31 am
- 0 votes
0
Have you tried CombinePerKey like this?
```
import apache_beam as beam

p = beam.Pipeline()

test = (
    p
    | beam.Create([(0, "ttt"), (0, "ttt1"), (0, "ttt2"), (1, "xxx"), (1, "xxx2"), (2, "yyy")])
    | beam.CombinePerKey(lambda v: ",".join(v))
    | beam.Map(print)
)
```
Login or Signup to reply.

- JeffKlukas
- February 27, 2023 at 3:50 pm
- 0 votes
0
Is it important to you to have the unique IDs be integers from 0 to n_groups like in your linked example?

If not, then I don’t think there’s any need to use a grouping operation here. Consider the following:
```
import apache_beam as beam

def make_unique_id(row):
  """
  Example function for extracting a unique ID from the row.

  You could wrap the value in uuid.UUID to make a more standard format for the ID.
  """
  return ",".join([row[0], row[1]])

pipe = beam.Pipeline()

id = (pipe
            | beam.io.ReadFromText('data.csv')
            | beam.Map(lambda x: x.split(","))
            | beam.Map(lambda x: [make_unique_id(x)] + x)
            | beam.Map(print))
```
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.