Unleashing the Power of AWS QuickSight: Overcoming Data Source Deletion Challenges for Seamless Dashboard Operation

Cloud computing for digital storage and transfer big data on int

By Siva Nagi Reddy Munagala Technical Blog September 17, 2024

Introduction:

In this blog post, we will delve into a prevalent issue encountered by AWS QuickSight users: the unexpected deletion of data sources has affected the visual information on the dashboard. AWS QuickSight offers a robust and intuitive platform for data visualization, yet deleting data sources can pose challenges. We will investigate the root causes of this problem and provide effective solutions.

What is Amazon QuickSight?

Amazon QuickSight is a fast cloud-powered BI service that makes it easy to build visualization, perform ad hoc analysis, and quickly get business insights from your data.

It allows you to easily create and publish interactive dashboards that include machine learning insights.

Exploring Data Sources, Datasets, Analysis, and Dashboards in AWS QuickSight

Datasource refers to the source of the data you want to analyze, such as databases, files, or AWS services. To pull in data for visualization, you connect QuickSight to these data sources.

A dataset is a prepared and transformed subset of data from the data source for analysis. It involves selecting specific tables, columns, or performing data cleaning and transformation tasks.

The analysis is the process where you create visual representations (charts, graphs, etc.) of your data. This is where you build your data visualizations and explore your data in detail.

A dashboard is a collection of analyses and visualizations that provide an at-a-glance overview of the metrics and key performance indicators (KPIs) important to your business. It is a comprehensive, interactive interface where users can monitor and analyze their data in a consolidated view.

Reasons for Deleting Data Sources in AWS QuickSight:

User Error:

One of the main causes of data source deletion in AWS QuickSight is user error. Users may accidentally remove critical data sources without realizing the impact of their actions. This can lead to the breakdown of all dashboards that depend on the data source.

Multiple Datasources with the Same Name:

The ability to create multiple datasources with the same name in AWS QuickSight can also lead to accidental deletion of data sources. This can cause confusion and increase the likelihood of deleting the wrong datasource if not managed carefully.

Preventive Measures:

To implement preventative measures, understand the underlying reasons for unintentional data source deletion.
Limit delete permissions for Quicksight users.
Implement clear and unique naming conventions for data sources to prevent confusion and accidental deletions.
Ensure users undergo thorough training on data management.
Educate users on the possible repercussions of deleting data sources.

Challenges Encountered:

The deletion of data sources in AWS QuickSight can lead to loss of crucial information and disruptions in data visualization processes. Users may experience difficulties in reconstructing deleted data sources, affecting the overall efficiency and accuracy of their analytics. It is essential to address these challenges promptly to maintain the integrity of data visualization projects.

Prerequisites:

You must have the following prerequisites:

An AWS account
A Enterprise QuickSight account
AWS Identity and Access Management (IAM)
AWS Lambda

Solution Overview:

Detecting Deleted Data Sources

Identifying Associated Datasets

Restoring Deleted Data Sources

Updating Associated Datasets

IAM Policy for Recovering Deleted Datasources

When working with Amazon QuickSight, it’s essential to have the appropriate permissions to manage and recover datasets effectively. Below is an IAM policy that grants comprehensive permissions for creating, listing, getting, updating, describing, searching, and setting dashboards, analyses, datasets, and data sources in Amazon QuickSight.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "quicksight:CreateDataSource",
                "quicksight:ListDataSets",
                "quicksight:ListDataSources",
                "quicksight:ListAnalysis" ,
                "quicksight:UpdateDashboard",
                "quicksight:UpdateAnalysis",
                "quicksight:UpdateDataSet",
                "quicksight:UpdateDataSource",
                "quicksight:DescribeDashboard",
                "quicksight:DescribeAnalysis",
                "quicksight:DescribeDataSet",
                "quicksight:DescribeDataSource"
            ],
            "Resource": [
                "arn:aws:quicksight:::dashboard/*",
                "arn:aws:quicksight:::analysis/*",
                "arn:aws:quicksight:::dataset/*",
                "arn:aws:quicksight:::datasource/*"
            ]
        }
    ]
}

Detecting Deleted Data Sources :

There are two ways we can obtain those details.

We can navigate to CloudWatch Logs and retrieve the relevant logs. From there, we can extract the deleted data source ID.
If you don’t have CloudWatch Logs enabled, you can use the following code to get the list of data sources and their corresponding IDs.

The following Python script retrieves a list of all data sources and their respective IDs.

import boto3
aws_acct_region='your_region_name'
def lambda_handler(event, context):
    sts = boto3.client('sts',region_name=aws_acct_region)
    aws_account_id = sts.get_caller_identity()['Account']
    client = boto3.client('quicksight',region_name=aws_acct_region)
    # List data sources
    response = client.list_data_sources(AwsAccountId=aws_account_id)
    data_sources = response['DataSources']
    # Prepare output data
    data_source_details = [['Name', 'DataSourceId']]
    for data_source in data_sources:
        data_source_name = data_source['Name']
        data_source_id = data_source['DataSourceId']
        data_source_details.append([data_source_name, data_source_id])
    # Create CSV data
    csv_data = '\n'.join([','.join(map(str, row)) for row in data_source_details])
    # Optional: Upload to S3 (replace bucket and key with your details)
    # bucket_name = 'your_bucket_name'
    # s3_key = 'file_name.csv'
    # s3_client = boto3.client('s3')
    # s3_client.put_object(Body=csv_data, Bucket=bucket_name, Key=s3_key)
    return {
        'statusCode': 200,
        'body': 'Data sources list retrieved successfully'
    }

Explanation:

Using the sts client, the script retrieves the AWS account ID.
It uses the Boto3 client to connect to the QuickSight service.
The list_data_sources function retrieves a list of all data sources for the specified account.
The script iterates through each data source and extracts its name and ID.
It generates lists of headers (Name, DataSourceId ) and data rows for each source.
The script converts the data into CSV format using list comprehension.

Identifying Associated Datasets:

Based on the data source ID we retrieved earlier, we can now identify all datasets created using that specific data source. This process involves querying the AWS QuickSight API to gather information about datasets associated with the given data source. We can identify the datasets linked to the specified data source by iterating through the list.

The following Python script retrieves a list of all data sources and their respective IDs.

import boto3
aws_acct_region='your_region_name'

def lambda_handler(event, context):
    sts = boto3.client('sts',region_name=aws_acct_region)
    aws_account_id = str(sts.get_caller_identity()['Account'])
    data_source_id='your_data_source_id'
    data_source_arn = "arn:aws:quicksight:{}:{}:datasource/{}".format(aws_acct_region,aws_account_id, data_source_id)
    client = boto3.client('quicksight',region_name=aws_acct_region)

    # Get the list of datasets
    datasets = []
    next_token = None
    while True:
        if next_token:
            response = client.list_data_sets(AwsAccountId=aws_account_id, NextToken=next_token)
        else:
            response = client.list_data_sets(AwsAccountId=aws_account_id)
        datasets.extend(response['DataSetSummaries'])
        next_token = response.get('NextToken')
        if not next_token:
            break

    # Extract dataset names and IDs
    data_source_details = []
    for dataset in datasets:
        try:
            dataset_response = client.describe_data_set(AwsAccountId=aws_account_id, DataSetId=dataset['DataSetId'])
        except client.exceptions.InvalidParameterValueException as e:
            print(f"Error describing dataset {dataset['DataSetId']}: {e}")
            continue

        data_source_arn_ = None
        for physical_table_id, physical_table_details in dataset_response['DataSet']['PhysicalTableMap'].items():
            if physical_table_details.get('CustomSql', {}).get('DataSourceArn'):
                data_source_arn_ = physical_table_details['CustomSql']['DataSourceArn']
                break

        if data_source_arn_ == data_source_arn:
            data_source_details.append([dataset['Name'], dataset['DataSetId']])

    # Write dataset names and IDs to a CSV string
    csv_data = '\n'.join([','.join(map(str, row)) for row in data_source_details])

    # Upload the CSV data to S3
    bucket_name = 'your_bucket_name'
    s3_key = 'file_name.csv'
    s3_client = boto3.client('s3',region_name=aws_acct_region)
    s3_client.put_object(Body=csv_data, Bucket=bucket_name, Key=s3_key)

    return {
        'statusCode': 200,
        'body': 'Dataset list uploaded to S3 successfully'
    }

Explanation:

Using the sts client, the script retrieves the AWS account ID.
It uses the Boto3 client to connect to the QuickSight service.
The list_data_sets function retrieves a list of all data sets for the specified account.
List Datasets: The code uses the list_data_sets API of the QuickSight client to retrieve a list of datasets in the AWS account. It paginates through the results if there are more datasets than the API can return in a single call.
Check Dataset Dependencies: To get detailed information about each dataset, the code uses the describe_data_set API, including its physical table map. The code then verifies whether any physical tables within the dataset are associated with the data source of interest. The data_source_details list adds the name and ID of a dataset that depends on the data source.

Restoring Deleted Data Sources:

In this guide, we’ll delve into the essential steps required to recreate a data source in AWS QuickSight. This comprehensive process will help you restore connectivity and functionality to your datasets and analyses, ensuring minimal disruption to your operations.

Here’s a step-by-step approach to recreating a deleted data source:

Gather Information: Before diving in, gather some key details about the deleted data source.
Data Source ID:This unique identifier helps reference the specific data source.
Data Source Type:Identify the type of data source (e.g., MySQL, PostgreSQL).
Connection Details:This includes information like database name, username, and password (secure these!).
Security Configuration:Note down any VPC connection details or SSL settings used by the original data source.

The Python script restores the deleted data source:

import boto3
aws_acct_region='your_region_name'
def lambda_handler(event, context):
    sts = boto3.client('sts',region_name=aws_acct_region)
    client = boto3.client('quicksight',region_name=aws_acct_region)
    aws_account_id = sts.get_caller_identity()['Account']
    # Specify the desired data source ID
    data_source_id = 'deleted_datasource_id'
    # Define the data source parameters
    data_source_parameters = {
        'RdsParameters': {
            'InstanceId': 'your_rds_instance_id',
            'Database': 'your_database_name'
        }
    }
    data_source_name = 'data_source_name'
    data_source_type = 'MYSQL'
    vpc_connection_name ='your_vpc_connection_name'
    vpc_connection_arn = 'arn:aws:quicksight:{}:{}:vpcConnection/{}'.format(aws_acct_region,aws_account_id, vpc_connection_name)
    disable_ssl = False
    # Define the data source credentials
    data_source_credentials = {
        'CredentialPair': {
            'Username': 'give_your_username',
            'Password': 'password'
        }
    }
    # Create the data source
    response = client.create_data_source(
        AwsAccountId=aws_account_id,
        DataSourceId=data_source_id,
        Name=data_source_name,
        Type=data_source_type,
        DataSourceParameters=data_source_parameters,
        Credentials=data_source_credentials,
        SslProperties={
            'DisableSsl': disable_ssl
        },
        VpcConnectionProperties={
            'VpcConnectionArn': vpc_connection_arn
        }
    )
    return response

Explanation:

The provided Python code snippet utilizing the Boto3 library demonstrates how to programmatically recreate a data source using the AWS SDK. This approach is helpful for automating the recreation process, especially if you’re dealing with multiple data sources.

Key Points in the Code:

The sts client is used to retrieve your AWS account ID and ARN.
It defines the data source parameters specific to your database type (MySQL in this example).
It specifies credentials and security configurations, such as VPC connections and SSL settings.
The create_data_source function creates a new data source with the provided details.

Updating Associated Datasets:

After recreating the data source, you must update all datasets that depended on the original source. Point them to the recreated data source to establish connectivity and ensure that your analysis and dashboards continue to function properly.

Backup the Dataset Configuration:

It’s critical to back up your datasets before making any updates. AWS QuickSight provides an option to export dataset settings, such as field mappings, data types, and customizations, into a JSON file. This file serves as a backup that allows you to recreate the dataset in its previous state if needed. Having this backup ensures that you can easily revert to a previous state if any issues occur during the deletion process.

The Python script backs up the dataset configuration.

import json
import boto3
import csv
aws_acct_region='your_region_name'
client = boto3.client('quicksight',region_name=aws_acct_region)
s3_client = boto3.client('s3',region_name=aws_acct_region)
s3_bucket_name = 'your_bucket_name'
def dump_dataset(account_id):
    datasets = []
    dataset_id = []
    response = client.list_data_sets(AwsAccountId=account_id)
    next_token: str = response.get("NextToken", None)
    datasets += response["DataSetSummaries"]
    while next_token is not None:
        response = client.list_data_sets(AwsAccountId=account_id, NextToken=next_token)
        next_token = response.get("NextToken", None)
        datasets += response["DataSetSummaries"]
    for i in datasets:
        dataset_id.append(i["DataSetId"])

    
    # Create a list of dictionary keys (column names)
    column_names = ['id', 'name', 'physical_table', 'logical_table', 'mode']
    
    # Write the dictionary data to a CSV file
    with open('/tmp/dataset.csv', 'w', newline='') as csv_file:
        writer = csv.DictWriter(csv_file, fieldnames=column_names)
        writer.writeheader()  # Write column names as the header
    
        for i in dataset_id:
            try:
                res = client.describe_data_set(
                    AwsAccountId=account_id,
                    DataSetId=i
                )
                dataset_dict = {
                    'id': res['DataSet']['DataSetId'],
                    'name': res['DataSet']['Name'],
                    'physical_table': res['DataSet']['PhysicalTableMap'],
                    'logical_table': res['DataSet']['LogicalTableMap'],
                    'mode': res['DataSet']['ImportMode']
                }
                writer.writerow(dataset_dict)
            except Exception as e:
                print(e)
                continue
            
    
    file_name = 'your_file_name.csv'        
    
    # Upload the CSV file to S3
    s3_client.upload_file('/tmp/dataset.csv', s3_bucket_name, file_name)
            
    return dataset_dict
    
    
def lambda_handler(event, context):
    sts = boto3.client('sts',region_name=aws_acct_region)
    aws_account_id = str(sts.get_caller_identity()['Account'])
    response = dump_dataset(aws_account_id)
 
    return {
        'statusCode': 200,
        'body': json.dumps('Successfully taken the backup of our datasets')
    }

Explanation:

You can back up your dataset configurations from AWS QuickSight to an S3 bucket using the code snippet above, ensuring you have a copy of your datasets in case you need to restore them in the future.

Modify the dataset configuration:

Retrieve Dataset Information:

Use the describe_data_set API to retrieve detailed information about each dataset, including its ID, name, and current configuration.

Modify Dataset Configuration:

To update the configuration as needed, edit the retrieved dataset information. This may include changing field mappings, data types, or any other dataset settings.

To remove unnecessary information, edit the response:

Remove lines 2, 3, 4, 7, 8 starting with “Status,” “Dataset,” “Arn,” “CreatedTime,” and “LastUpdatedTime.”.
Remove lines containing “ConsumedSpiceCapacityInBytes” and “RequestId”.
Remove the JSON block for “OutputColumns”.

Update Dataset Configuration:

To apply the modified configuration to the dataset, use the update_data_set API.

Here is a Python script that shows you how to modify the dataset configuration:

import boto3
aws_acct_region='your_region_name'
def update_dataset(dataset_id):
    # Define the account ID
    account_id = "your_account_id"
    client = boto3.client("quicksight",region_name=aws_acct_region)
    try:
        response = client.describe_data_set(
            AwsAccountId=account_id,
            DataSetId=dataset_id
        )
        # Edit the dataset description
        dataset_description = response["DataSet"]
        # Remove unnecessary keys from the dataset description
        dataset_description.pop("Arn", None)
        dataset_description.pop("CreatedTime", None)
        dataset_description.pop("LastUpdatedTime", None)
        dataset_description.pop("ConsumedSpiceCapacityInBytes", None)
        dataset_description.pop("RequestId", None)
        dataset_description.pop("OutputColumns", None)
        dataset_description.pop("RowLevelPermissionDataSet", None)
        # Update the dataset with the modified description
        update_params = {
            "AwsAccountId": account_id,
            "DataSetId": dataset_id,
            **dataset_description  # Include all remaining keys from the dataset description
        }
        response = client.update_data_set(**update_params)
        print(f"Dataset {dataset_id} updated successfully!")
    except Exception as e:
        print(f"Error updating dataset {dataset_id}: {str(e)}")
def lambda_handler(event, context):
    # Define the list of dataset IDs
    dataset_ids = ['your dataset ids' ]
    for dataset_id in dataset_ids:
        update_dataset(dataset_id)

Conclusion:

Ensuring data integrity and resilience in AWS QuickSight requires a proactive approach that combines preventive measures, systematic identification of deleted data sources, and automated processes for recreating and updating datasets. By implementing stringent permissions, clear naming conventions, and comprehensive user training, organizations can reduce the risk of accidental data source deletions. Regularly backing up dataset configurations to S3 and using scripts to update configurations post-recreation ensures data continuity, enhancing data governance, and minimizing disruptions in data visualization and analytics workflows within AWS QuickSight.

References:

Siva Nagi Reddy Munagala

+ posts

Cookie	Duration	Description
_GRECAPTCHA	5 months 27 days	This cookie is set by Google. In addition to certain standard Google cookies, reCAPTCHA sets a necessary cookie (_GRECAPTCHA) when executed for the purpose of providing its risk analysis.
cookielawinfo-checbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other".
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category.
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
bcookie	2 years	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
lang	session	This cookie is used to store the language preferences of a user to serve up content in that stored language the next time user visit the website.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
Zoominfo	session	Zoominfo uses technologies to collect and store information when you interact with services it offer to their partners, such as advertising services or analytics. All of those processes are meant to improve your user experience and the overall quality of our services.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gat_gtag_UA_111355416_1	1 minute	Set by Google to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
_hjAbsoluteSessionInProgress	30 minutes	This cookie is used to detect the first pageview session of a user. This is a True/False flag set by the cookie.
_hjFirstSeen	30 minutes	This is set by Hotjar to identify a new user’s first session. It stores a true/false value, indicating whether this was the first time Hotjar saw this user. It is used by Recording filters to identify new user sessions.
_hjid	1 year	This is a Hotjar cookie that is set when the customer first lands on a page using the Hotjar script.
_hjIncludedInPageviewSample	2 minutes	This cookie is set to let Hotjar know whether the user is included in the data sampling defined by site's pageview limit.
_hjIncludedInSessionSample	2 minutes	This cookie is set to let Hotjar know whether the user is included in the data sampling defined by site's daily session limit.
_hjTLDTest	session	Hotjar test cookie to check the most generic cookie path it should use, instead of the page hostname. This is done so that cookies can be shared across subdomains (where applicable). To determine this, we store the _hjTLDTest cookie for different URL substring alternatives until it fails. After this check, the cookie is removed.
oktgid	1 year	This cookie is used for storing the visitor ID of the user who clicked on an okt.to link.
oktsid	session	This cookie is used for storing the session ID of the user who clicked on an okt.to link.

Cookie	Duration	Description
bscookie	2 years	This cookie is a browser ID cookie set by Linked share Buttons and ad tags.
personalization_id	2 years	Twitter sets this cookie to integrate and share features for social media and also store information about how the user uses the website, for tracking and targeting.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by YouTube and is used to track the views of embedded videos on YouTube pages.

Cookie	Duration	Description
__gwtCookieCheck	session	This cookie is used to check if the visitors' browser supports cookies.
AnalyticsSyncHistory	1 month	These cookies are used to deliver advertisements more relevant to you and your interests. They are also used to limit the number of times you see an advertisement as well as help measure the effectiveness of the advertising campaign. They remember that you have visited a website and this information is shared with other organizations such as advertisers.
li_gc	2 years	These cookies are used to deliver advertisements more relevant to you and your interests. They are also used to limit the number of times you see an advertisement as well as help measure the effectiveness of the advertising campaign. They remember that you have visited a website and this information is shared with other organizations such as advertisers.
UserMatchHistory	1 month	LinkedIn - Used to track visitors on multiple websites, in order to present relevant advertisement based on the visitor's preferences.