skip to Main Content

I am working with Synpase Spark Pools in a controlled corporate environment. I have limited permission to query AAD but I can create UAMIs and assign them to Resources.

When I access my Synpase workspace I can create a Spark Job Definition to read some data from ADLS. Looking at the Apache Spark Applications list under the Monitor tab I can see that these jobs use my identity ([email protected]) as the ‘Submitter’, and since I have given myself rx access to the data store these succeed.

Now if i create a Pipeline, and configure it to run my Spark Job Definition, it fails with an authorisation error. Going back to Apache Spark Applications list under Monitor I see that my Pipeline has a different Identity used as Submitter, which would explain why it is not authorised to access the data.

Firstly, I’m not sure which identity is now being used as Submitter, I don’t recognise the UUID as either my Synapse Workspace SAMI or UAMI, (but I can’t query AAD for more info).

However in general it occurs to me that I would probably like to be able to assign explicit UAMIs for my Pipelines to run under. Is this possible? Or is there a different model for managing this?

2

Answers


  1. Chosen as BEST ANSWER

    Slightly slow update to this but I've arrived at something of an answer in terms of understanding, if not quite a solution. Will be useful to share here for anyone following or looking into the same questions.

    First of all, when accessing the Synapse Workspace through the Portal/UI the actionable Identity that is used by Notebooks or a Standalone 'Apache Spark Job Definition', is the Identity of the User that is logged in, (via 'AAD Passthrough'). This is great for user experience, especially in Notebooks, and you just need to make sure that you as the individual have personal access to any data sources you use. In some cases, where your user identity doesn't have this access, you could make use of a Workspace Linked Service identity instead, but not always! (keep reading)

    Once you switch to using Pipelines however, the Identity used is the System Assigned Managed Identity (SAMI) of the workspace, which is created and assigned at resource creation. This is ok, but it is important to understand the granularity, ie. it is the Workspace that has access to resources, not individual Pipelines. Therefore if you want to run Pipelines with different levels of access, you will need to deploy them to segregated Synapse Workspaces, (with distinct SAMIs).

    One aside on this is the identity of the 'Submitter' that I mentioned in my original question, which is visible under the monitor tab of the Synapse workspace for all Apache Spark applications. When running as the user (eg. Notebooks), this submitter ID is my AAD username, which is straightforward. However when running as a pipeline the Submitter ID is 'ee20d9e7-6295-4240-ba3f-c3784616c565', and I mean literally this same UUID for EVERYONE. It turns out this is the id of ADF as an enterprise application. Not very useful, compared to putting the Workspace SAMI in here for example, but that's what it is in case anyone else is drifting down that rabbit hole!

    You can create and assign an additional User Assigned Managed Identity (UAMI) to the Workspace, but this will be not be used by an executing pipeline. The UAMI can be used by a Workspace Linked Service, but that has some of its own limitations (mentioned below). Also my experience is that a UAMI assigned at workspace creation will not be correctly 'associated' to the Workspace until I manually create a 2nd UAMI in the portal. I haven't gone deep into this as turns out UAMIs are no good to me but seems like a straightforward bug.

    Now my specific use case is for running Apache Spark Applications in Synapse Pipelines, and the straightforward way to make this work is to make sure the Workspace SAMI has access to required resources and you're good to go. If you just want to make it work then do this and stop here, but if you want to look a little deeper carry on...

    The suggestion in some of the Microsoft documentation is that you should be able to use a Workspace Linked Service within a Spark Application in order to get access to Resources. However this doesn't work, I've been discussing the same with Microsoft and they have confirmed the same and are investigating. So at this point it's worth noting the date (02/02/2023 - handily unambiguous for American readers ;-)), because the issue may later be resolved. But right now your only option in your Spark code is to fall back on the user/workspace identities.

    Just a thought on why this matters, it is not really for segregation since any resource running in the Workspace can access any Linked Service. It is really more a question of Identity and Resource Management, ie. it would be better to separate the Identities being used and assigned to Resources for access from the Resources themselves. In most cases we'd rather do this with groups that individual identities, and if the management processes are long-winded (mine are) then I'd rather not have to repeat them every time I create a resource.

    Anyway that's enough for now, will update if this changes while I'm still paying attention...


  2. As I understand the ask here is to know how to read the data from ADLS from a spark job . Since you have the access the the ADLS , so works fine . I thnk you will have to set up the permission for the Synapse Workspace on the ADLS and it should work fine .

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search