No Extra Tableau Downtime: Metadata API for Proactive Information Well being -

In at this time’s world, the reliability of information options is all the pieces. After we construct dashboards and reviews, one expects that the numbers mirrored there are appropriate and up-to-date. Based mostly on these numbers, insights are drawn and actions are taken. For any unexpected purpose, if the dashboards are damaged or if the numbers are incorrect — then it turns into a fire-fight to repair all the pieces. If the problems will not be fastened in time, then it damages the belief positioned on the info workforce and their options.

However why would dashboards be damaged or have mistaken numbers? If the dashboard was constructed appropriately the primary time, then 99% of the time the problem comes from the info that feeds the dashboards — from the info warehouse. Some attainable situations are:

Few ETL pipelines failed, so the brand new knowledge shouldn’t be but in
A desk is changed with one other new one
Some columns within the desk are dropped or renamed
Schemas in knowledge warehouse have modified
And plenty of extra.

There’s nonetheless an opportunity that the problem is on the Tableau web site, however in my expertise, many of the instances, it’s at all times resulting from some adjustments in knowledge warehouse. Despite the fact that we all know the basis trigger, it’s not at all times simple to begin engaged on a repair. There’s no central place the place you’ll be able to verify which Tableau knowledge sources depend on particular tables. When you’ve got the Tableau Information Administration add-on, it might assist, however from what I do know, its arduous to search out dependencies of customized sql queries utilized in knowledge sources.

However, the add-on is just too costly and most firms don’t have it. The actual ache begins when you need to undergo all the info sources manually to begin fixing it. On prime of it, you’ve gotten a string of customers in your head impatiently ready for a quick-fix. The repair itself won’t be tough, it could simply be a time-consuming one.

What if we might anticipate these points and establish impacted knowledge sources earlier than anybody notices an issue? Wouldn’t that simply be nice? Effectively, there’s a means now with the Tableau Metadata API. The Metadata API makes use of GraphQL, a question language for APIs that returns solely the info that you simply’re fascinated about. For more information on what’s attainable with GraphQL, do try GraphQL.org.

On this weblog submit, I’ll present you ways to hook up with the Tableau Metadata API utilizing Python’s Tableau Server Shopper (TSC) library to proactively establish knowledge sources utilizing particular tables, so as to act quick earlier than any points come up. As soon as you understand which Tableau knowledge sources are affected by a selected desk, you can also make some updates your self or alert the house owners of these knowledge sources in regards to the upcoming adjustments to allow them to be ready for it.

Connecting to the Tableau Metadata API

Lets connect with the Tableau Server utilizing TSC. We have to import in all of the libraries we would want for the train!

### Import all required libraries
import tableauserverclient as t
import pandas as pd
import json
import ast
import re

So as to connect with the Metadata API, you’ll have to first create a private entry token in your Tableau Account settings. Then replace the <API_TOKEN_NAME> & <TOKEN_KEY> with the token you simply created. Additionally replace <YOUR_SITE> along with your Tableau web site. If the connection is established efficiently, then “Related” can be printed within the output window.

### Connect with Tableau server utilizing private entry token
tableau_auth = t.PersonalAccessTokenAuth("<API_TOKEN_NAME>", "<TOKEN_KEY>", 
                                           site_id="<YOUR_SITE>")
server = t.Server("https://dub01.on-line.tableau.com/", use_server_version=True)

with server.auth.sign_in(tableau_auth):
        print("Related")

Lets now get a listing of all knowledge sources which can be revealed in your web site. There are lots of attributes you’ll be able to fetch, however for the present use case, lets hold it easy and solely get the id, identify and proprietor contact info for each knowledge supply. This can be our grasp record to which we are going to add in all different info.

############### Get all of the record of information sources in your Website

all_datasources_query = """ {
  publishedDatasources {
    identify
    id
    proprietor {
    identify
    electronic mail
    }
  }
}"""
with server.auth.sign_in(tableau_auth):
    end result = server.metadata.question(
        all_datasources_query
    )

Since I need this weblog to be focussed on tips on how to proactively establish which knowledge sources are affected by a selected desk, I’ll not be going into the nuances of Metadata API. To higher perceive how the question works, you’ll be able to consult with a really detailed Tableau’s personal Metadata API documentation.

One factor to notice is that the Metadata API returns knowledge in a JSON format. Relying on what you’re querying, you’ll find yourself with a number of nested json lists and it could get very tough to transform this right into a pandas dataframe. For the above metadata question, you’ll find yourself with a end result which would love under (that is mock knowledge simply to provide you an thought of what the output appears like):

{
  "knowledge": {
    "publishedDatasources": [
      {
        "name": "Sales Performance DataSource",
        "id": "f3b1a2c4-1234-5678-9abc-1234567890ab",
        "owner": {
          "name": "Alice Johnson",
          "email": "[email protected]"
        }
      },
      {
        "identify": "Buyer Orders DataSource",
        "id": "a4d2b3c5-2345-6789-abcd-2345678901bc",
        "proprietor": {
          "identify": "Bob Smith",
          "electronic mail": "[email protected]"
        }
      },
      {
        "identify": "Product Returns and Profitability",
        "id": "c5e3d4f6-3456-789a-bcde-3456789012cd",
        "proprietor": {
          "identify": "Alice Johnson",
          "electronic mail": "[email protected]"
        }
      },
      {
        "identify": "Buyer Segmentation Evaluation",
        "id": "d6f4e5a7-4567-89ab-cdef-4567890123de",
        "proprietor": {
          "identify": "Charlie Lee",
          "electronic mail": "[email protected]"
        }
      },
      {
        "identify": "Regional Gross sales Traits (Customized SQL)",
        "id": "e7a5f6b8-5678-9abc-def0-5678901234ef",
        "proprietor": {
          "identify": "Bob Smith",
          "electronic mail": "[email protected]"
        }
      }
    ]
  }
}

We have to convert this JSON response right into a dataframe in order that its simple to work with. Discover that we have to extract the identify and electronic mail of the proprietor from contained in the proprietor object.

### We have to convert the response into dataframe for straightforward knowledge manipulation

col_names = end result['data']['publishedDatasources'][0].keys()
master_df = pd.DataFrame(columns=col_names)

for i in end result['data']['publishedDatasources']:
    tmp_dt = {ok:v for ok,v in i.gadgets()}
    master_df = pd.concat([master_df, pd.DataFrame.from_dict(tmp_dt, orient='index').T])

# Extract the proprietor identify and electronic mail from the proprietor object
master_df['owner_name'] = master_df['owner'].apply(lambda x: x.get('identify') if isinstance(x, dict) else None)
master_df['owner_email'] = master_df['owner'].apply(lambda x: x.get('electronic mail') if isinstance(x, dict) else None)

master_df.reset_index(inplace=True)
master_df.drop(['index','owner'], axis=1, inplace=True)
print('There are ', master_df.form[0] , ' datasources in your web site')

That is how the construction of master_df would seem like:

As soon as we’ve the primary record prepared, we will go forward and begin getting the names of the tables embedded within the knowledge sources. If you’re an avid Tableau consumer, you understand that there are two methods to deciding on tables in a Tableau knowledge supply — one is to straight select the tables and set up a relation between them and the opposite is to make use of a customized sql question with a number of tables to attain a brand new resultant desk. Subsequently, we have to tackle each the instances.

Processing of Customized SQL question tables

Beneath is the question to get the record of all customized SQLs used within the web site together with their knowledge sources. Discover that I’ve filtered the record to get solely first 500 customized sql queries. In case there are extra in your org, you’ll have to use an offset to get the subsequent set of customized sql queries. There’s additionally an choice of utilizing cursor technique in Pagination while you wish to fetch giant record of outcomes (refer right here). For the sake of simplicity, I simply use the offset technique as I do know, as there are lower than 500 customized sql queries used on the location.

# Get the info sources and the desk names from all of the customized sql queries used in your Website

custom_table_query = """  {
  customSQLTablesConnection(first: 500){
    nodes {
        id
        identify
        downstreamDatasources {
        identify
        }
        question
    }
  }
}
"""

with server.auth.sign_in(tableau_auth):
    custom_table_query_result = server.metadata.question(
        custom_table_query
    )

Based mostly on our mock knowledge, that is how our output would seem like:

{
  "knowledge": {
    "customSQLTablesConnection": {
      "nodes": [
        {
          "id": "csql-1234",
          "name": "RegionalSales_CustomSQL",
          "downstreamDatasources": [
            {
              "name": "Regional Sales Trends (Custom SQL)"
            }
          ],
          "question": "SELECT r.region_name, SUM(s.sales_amount) AS total_sales FROM ecommerce.sales_data.Gross sales s JOIN ecommerce.sales_data.Areas r ON s.region_id = r.region_id GROUP BY r.region_name"
        },
        {
          "id": "csql-5678",
          "identify": "ProfitabilityAnalysis_CustomSQL",
          "downstreamDatasources": [
            {
              "name": "Product Returns and Profitability"
            }
          ],
          "question": "SELECT p.product_category, SUM(s.revenue) AS total_profit FROM ecommerce.sales_data.Gross sales s JOIN ecommerce.sales_data.Merchandise p ON s.product_id = p.product_id GROUP BY p.product_category"
        },
        {
          "id": "csql-9101",
          "identify": "CustomerSegmentation_CustomSQL",
          "downstreamDatasources": [
            {
              "name": "Customer Segmentation Analysis"
            }
          ],
          "question": "SELECT c.customer_id, c.location, COUNT(o.order_id) AS total_orders FROM ecommerce.sales_data.Clients c JOIN ecommerce.sales_data.Orders o ON c.customer_id = o.customer_id GROUP BY c.customer_id, c.location"
        },
        {
          "id": "csql-3141",
          "identify": "CustomerOrders_CustomSQL",
          "downstreamDatasources": [
            {
              "name": "Customer Orders DataSource"
            }
          ],
          "question": "SELECT o.order_id, o.customer_id, o.order_date, o.sales_amount FROM ecommerce.sales_data.Orders o WHERE o.order_status = 'Accomplished'"
        },
        {
          "id": "csql-3142",
          "identify": "CustomerProfiles_CustomSQL",
          "downstreamDatasources": [
            {
              "name": "Customer Orders DataSource"
            }
          ],
          "question": "SELECT c.customer_id, c.customer_name, c.phase, c.location FROM ecommerce.sales_data.Clients c WHERE c.active_flag = 1"
        },
        {
          "id": "csql-3143",
          "identify": "CustomerReturns_CustomSQL",
          "downstreamDatasources": [
            {
              "name": "Customer Orders DataSource"
            }
          ],
          "question": "SELECT r.return_id, r.order_id, r.return_reason FROM ecommerce.sales_data.Returns r"
        }
      ]
    }
  }
}

Identical to earlier than after we had been creating the grasp record of information sources, right here additionally we’ve nested json for the downstream knowledge sources the place we would want to extract solely the “identify” a part of it. Within the “question” column, your complete customized sql is dumped. If we use regex sample, we will simply seek for the names of the desk used within the question.

We all know that the desk names at all times come after FROM or a JOIN clause and so they usually observe the format <database_name>.<schema>.<table_name>. The <database_name> is non-compulsory and many of the instances not used. There have been some queries I discovered which used this format and I ended up solely getting the database and schema names, and never the whole desk identify. As soon as we’ve extracted the names of the info sources and the names of the tables, we have to merge the rows per knowledge supply as there could be a number of customized sql queries utilized in a single knowledge supply.

### Convert the customized sql response into dataframe
col_names = custom_table_query_result['data']['customSQLTablesConnection']['nodes'][0].keys()
cs_df = pd.DataFrame(columns=col_names)

for i in custom_table_query_result['data']['customSQLTablesConnection']['nodes']:
    tmp_dt = {ok:v for ok,v in i.gadgets()}

    cs_df = pd.concat([cs_df, pd.DataFrame.from_dict(tmp_dt, orient='index').T])

# Extract the info supply identify the place the customized sql question was used
cs_df['data_source'] = cs_df.downstreamDatasources.apply(lambda x: x[0]['name'] if x and 'identify' in x[0] else None)
cs_df.reset_index(inplace=True)
cs_df.drop(['index','downstreamDatasources'], axis=1,inplace=True)

### We have to extract the desk names from the sql question. We all know the desk identify comes after FROM or JOIN clause
# Word that the identify of desk could be of the format <data_warehouse>.<schema>.<table_name>
# Relying on the format of how desk known as, you'll have to modify the regex expression

def extract_tables(sql):
    # Regex to match database.schema.desk or schema.desk, keep away from alias
    sample = r'(?:FROM|JOIN)s+((?:[w+]|w+).(?:[w+]|w+)(?:.(?:[w+]|w+))?)b'
    matches = re.findall(sample, sql, re.IGNORECASE)
    return record(set(matches))  # Distinctive desk names

cs_df['customSQLTables'] = cs_df['query'].apply(extract_tables)
cs_df = cs_df[['data_source','customSQLTables']]

# We have to merge datasources as there could be a number of customized sqls utilized in the identical knowledge supply
cs_df = cs_df.groupby('data_source', as_index=False).agg({
    'customSQLTables': lambda x: record(set(merchandise for sublist in x for merchandise in sublist))  # Flatten & make distinctive
})

print('There are ', cs_df.form[0], 'datasources with customized sqls utilized in it')

After we carry out all of the above operations, that is how the construction of cs_df would seem like:

Processing of normal Tables in Information Sources

Now we have to get the record of all of the common tables utilized in a datasource which aren’t part of customized SQL. There are two methods to go about it. Both use the publishedDatasources object and verify for upstreamTables or use DatabaseTable and verify for upstreamDatasources. I’ll go by the primary technique as a result of I need the outcomes at an information supply stage (mainly, I need some code able to reuse once I wish to verify a selected knowledge supply in additional element). Right here once more, for the sake of simplicity, as a substitute of going for pagination, I’m looping by every datasource to make sure I’ve all the pieces. We get the upstreamTables inside the sphere object in order that must be cleaned out.

############### Get the info sources with the common desk names utilized in your web site

### Its greatest to extract the tables info for each knowledge supply after which merge the outcomes.
# Since we solely get the desk info nested below fields, in case there are a whole bunch of fields 
# utilized in a single knowledge supply, we are going to hit the response limits and will be unable to retrieve all the info.

data_source_list = master_df.identify.tolist()

col_names = ['name', 'id', 'extractLastUpdateTime', 'fields']
ds_df = pd.DataFrame(columns=col_names)

with server.auth.sign_in(tableau_auth):
    for ds_name in data_source_list:
        question = """ {
            publishedDatasources (filter: { identify: """"+ ds_name + """" }) {
            identify
            id
            extractLastUpdateTime
            fields {
                identify
                upstreamTables {
                    identify
                }
            }
            }
        } """
        ds_name_result = server.metadata.question(
        question
        )
        for i in ds_name_result['data']['publishedDatasources']:
            tmp_dt = {ok:v for ok,v in i.gadgets() if ok != 'fields'}
            tmp_dt['fields'] = json.dumps(i['fields'])
        ds_df = pd.concat([ds_df, pd.DataFrame.from_dict(tmp_dt, orient='index').T])

ds_df.reset_index(inplace=True)

That is how the construction of ds_df would look:

We are able to must flatten out the fields object and extract the sphere names in addition to the desk names. Because the desk names can be repeating a number of instances, we must deduplicate to maintain solely the distinctive ones.

# Perform to extract the values of fields and upstream tables in json lists
def extract_values(json_list, key):
    values = []
    for merchandise in json_list:
        values.append(merchandise[key])
    return values

ds_df["fields"] = ds_df["fields"].apply(ast.literal_eval)
ds_df['field_names'] = ds_df.apply(lambda x: extract_values(x['fields'],'identify'), axis=1)
ds_df['upstreamTables'] = ds_df.apply(lambda x: extract_values(x['fields'],'upstreamTables'), axis=1)

# Perform to extract the distinctive desk names 
def extract_upstreamTable_values(table_list):
    values = set()a
    for inner_list in table_list:
        for merchandise in inner_list:
            if 'identify' in merchandise:
                values.add(merchandise['name'])
    return record(values)

ds_df['upstreamTables'] = ds_df.apply(lambda x: extract_upstreamTable_values(x['upstreamTables']), axis=1)
ds_df.drop(["index","fields"], axis=1, inplace=True)

As soon as we do the above operations, the ultimate construction of ds_df would look one thing like this:

We have now all of the items and now we simply must merge them collectively:

###### Be part of all the info collectively
master_data = pd.merge(master_df, ds_df, how="left", on=["name","id"])
master_data = pd.merge(master_data, cs_df, how="left", left_on="identify", right_on="data_source")

# Save the outcomes to analyse additional
master_data.to_excel("Tableau Information Sources with Tables.xlsx", index=False)

That is our last master_data:

Desk-level Affect Evaluation

Let’s say there have been some schema adjustments on the “Gross sales” desk and also you wish to know which knowledge sources can be impacted. Then you’ll be able to merely write a small perform which checks if a desk is current in both of the 2 columns — upstreamTables or customSQLTables like under.

def filter_rows_with_table(df, col1, col2, target_table):
    """
    Filters rows in df the place target_table is a part of any worth in both col1 or col2 (helps partial match).
    Returns full rows (all columns retained).
    """
    return df[
        df.apply(
            lambda row: 
                (isinstance(row[col1], record) and any(target_table in merchandise for merchandise in row[col1])) or
                (isinstance(row[col2], record) and any(target_table in merchandise for merchandise in row[col2])),
            axis=1
        )
    ]
# For example 
filter_rows_with_table(master_data, 'upstreamTables', 'customSQLTables', 'Gross sales')

Beneath is the output. You may see that 3 knowledge sources can be impacted by this variation. It’s also possible to alert the info supply house owners Alice and Bob prematurely about this to allow them to begin engaged on a repair earlier than one thing breaks on the Tableau dashboards.

You may try the whole model of the code in my Github repository right here.

That is simply one of many potential use-cases of the Tableau Metadata API. It’s also possible to extract the sphere names utilized in customized sql queries and add to the dataset to get a field-level influence evaluation. One may monitor the stale knowledge sources with the extractLastUpdateTime to see if these have any points or should be archived if they aren’t used any extra. We are able to additionally use the dashboards object to fetch info at a dashboard stage.

Remaining Ideas

When you’ve got come this far, kudos. This is only one use case of automating Tableau knowledge administration. It’s time to mirror by yourself work and assume which of these different duties you would automate to make your life simpler. I hope this mini-project served as an fulfilling studying expertise to grasp the facility of Tableau Metadata API. For those who preferred studying this, you may additionally like one other considered one of my weblog posts about Tableau, on among the challenges I confronted when coping with massive .

Additionally do try my earlier weblog the place I explored constructing an interactive, database-powered app with Python, Streamlit, and SQLite.

Earlier than you go…

Observe me so that you don’t miss any new posts I write in future; you can find extra of my articles on my . It’s also possible to join with me on LinkedIn or Twitter!

No Extra Tableau Downtime: Metadata API for Proactive Information Well being

Connecting to the Tableau Metadata API

Processing of Customized SQL question tables

Processing of normal Tables in Information Sources

Desk-level Affect Evaluation

Remaining Ideas

The 5-Half Framework to Keep away from the 80%

Be taught Find out how to Use and Entry Veo 3?

AI Imaginative and prescient on the road: enhancing security and precision

AI-First Google Colab is All You Want

GFN Thursday: 20 Video games Take part July

The 5-Half Framework to Keep away from the 80%

Be taught Find out how to Use and Entry Veo 3?

AI Imaginative and prescient on the road: enhancing security and precision

AI-First Google Colab is All You Want