Troubleshooting Guide

Edge Developer Toolbox Developer Guide

ID 783775

Date 06/07/2024

Version 24.05

Confidential

Starter Functionality

General

Users can schedule any workload in a batch deployment with a maximum of three edge nodes on Edge node selection (Launch) page.
A maximum of six deployments are supported currently; post that, we will show a message saying the maximum deployment limit is reached; the user can run the next workload after any of these six deployments’ status is completed.
Benchmark Smart Retail Analytics/Smart City Analytics catalog on Intel hardware: Once the application status is “Running,” click on Route Output to access the video feed, Grafana* dashboard, and visualize Telemetry. If workload status stays in the “Queued” state for a long time, we recommend cancelling and redeploying on other hardware.
Very rarely, users might see a “FAILED” status when doing batch deployment; in this scenario, the user can trigger the same workload on the same edge node again.
Due to the enablement of server-side events on the dashboard, users will see real-time status changes while running workloads. Users might see a “Running” state followed by “Deploying” for a few seconds due to some workload pods getting added. They may also see some Image pull errors, after which the workload will start running.
Terminating a helm-based workload might take around 5-6 seconds.
File system loading can sometimes take up to 30-60 seconds.
When a new user logs into the Edge Developer Toolbox, the Check File system is up, and all pre-loaded files are copied from S3 to the user’s PVC.
On the hardware selection page, users with premium role access can choose premium hardware or additional edge nodes for certain node types. This access allows for quicker deployment of workloads, reducing wait times in some cases.
When checking logs for running/completed pods during deployments, there will be some delay (greater than 30 seconds) as the logs are fetched from a large dataset.
Each Edge Developer Toolbox session expires after five hours of login time.
When a reference implementation, such as Smart City Analytics or Crash Event Prediction, is running, users may notice that routes to Grafana or Jupyter Lab URLs are redirected to the homepage. This is due to changes in the process monitor affecting all service ports. To fix this issue, users should delete the workload from the Applications page and then re-launch it.

Optimize AI model

While importing the question-answering model (i.e., bert-large-uncased-whole-word-masking-squad-0001), the import function may get stuck at 77%; restart the benchmark from the Dashboard and try again.
After importing a model from Hugging Face, you might encounter the error message “Unable to retrieve benchmark results” when attempting to view benchmark results. This may be due to incompatible hardware configuration or fully occupied hardware resources. Try again with a different hardware configuration.

Jupyter* notebooks for Generative AI

Opening the JupyterLab* environment might take up to 20-30 seconds.
If any session timeout-related issues are observed while opening JupyterLab, it is recommended that you clear your browser’s cookies and cache and try launching it again.

Sample Application

To build a container for Smart Retail/City Analytics applications, leverage the Image Builder Plugin in Visual Studio Code*.
Open the Smart Retail Analytics/Smart City Analytics code in the JupyterLab environment and follow the steps in the Readme file and container image. In case “Application not available” error is displayed, relaunch it from the Edge Developer Toolbox home page.
By default, the application will be in a running state for 120 minutes. The option to manually stop the workload is under My Dashboard -> Application Name.

VS Code

Relaunch the tool if an “Application not available” error is encountered when launching Visual Studio Code from the Develop AI Application section.
To clone any new Git repo using the Source control plugin, click the Close folder option from the File Menu. Once done, click on the Source control plugin and click on Clone repository.
Users can install any extension using the Extensions plugin, but these extensions are not recommended or tested.
Microsoft VS Code build logs on the Edge Developer Toolbox Application page are retained for one week. However, these logs are accessible for review only within the first five days of their retention period.

Bring your own application

For containers imported from public registries (for example, Docker, Quay, Azure, and GCP), wait until the image status is “Ready” before deploying on Intel hardware.
In importing the helm chart, if the user encounters the error “Cannot process URL, check if the URL is valid or private and try again,” check if the Helm* chart repo is valid or if it is private and provide the relevant credentials.
If the Helm chart status is in an “Error” state, delete the chart and try importing again.
If the Docker compose status is in an “Error” state, delete the chart and try importing again.
On launching JupyterLab, if a session timeout pop-up is displayed, the user can clear the cache, relaunch JupyterLab, or continue the session by ignoring the pop-up.
Some Helm charts might not be supported, and the user will see an Error Status after importing Helm. This is because Edge Developer Toolbox does not support certain resource names in Helm charts:
CLUSTER_TRIGGER_BINDING CLUSTER_TRIGGER CLUSTER_SERVICE CLUSTER_SERVICE_CLASS CLUSTER_SERVICE_PLAN CLUSTER_ROLE CLUSTER_ROLE_BINDINGS HostPID HostNetwork HostIPC HostPath

Events

The user can check the Events by clicking the bell icon on the dashboard for a workload deployed once its status is “Deploying” or “Running”.

Publish to Repo

On publishing a Helm chart, if the user encounters an “Error occurred. Unable to publish” error, then check if the .tgzn file for the helm chart exists in File system -> helm-tgz-files
If the Helm chart or container/source image is in an “Error” state and the Publish to Repo option is disabled, the user can try reimporting, or check if some resources in Helm charts are not supported, as mentioned in the Bring Your Own Application Artifacts section.
For importing the Helm chart from the file system, the first step is to clone the Git repo with a .tgz file on the VS Code* interface and then browse to the .tgz file from the File system.

File system

When uploading a file, click on the retry icon if a network error is displayed.
The maximum file size allowed for upload is 1 GB.

When a user clicks the Download button for a file in the file system, large files (i.e., those between 500 MBs and 5 GB) may take some time to changes to “downloading” status or appear in the Download Progress section.
When connecting to an AWS S3 or Azure bucket, if an “Invalid Credentials” error is encountered, try again.

Telemetry

Streaming telemetry is enabled for all workloads in the “Running” status.
To obtain information about GPU consumption in the Telemetry dashboard, run the container workload with “-e DEVICE=GPU” updated in the container configuration.

Edge Services

When trying out any Edge services, the launched service will keep running unless the user logs out of Edge Developer Toolbox.
The user can only run one edge service at a given time. Trying a second edge service when another is running will terminate the first service.
Launching a second instance of the same edge service while it is already running will terminate the currently running instance and relaunch it. This can take 2 to 3 minutes.

Microservice Integration

Users can only execute one service at a time. If they attempt to launch two services simultaneously, the system will display the error message: “Only one microservice can be run per user at a given point in time. Please log out and log in to run another service.”
Failure to log out after service launch will result in backend containers automatically terminating after 5 hours. During this period, trying to launch a deep link will prompt the error “Only one microservice can be run per user at a given point of time. Please log out and login to run another service.” To resolve this, users should log in again and then log out to terminate existing services.
Users may encounter the message “An error has occurred while deploying the microservice. Please click the button below to try again,” even with no services in the backend. This issue may arise because the Data Store Microservice and Intel® Edge Data Collection require some time for PVC creation and deletion. Clicking “Try again” can initiate a relaunch.
Before launching any microservice, ensure that pop-ups are enabled in the browser settings.

Premium Functionality

General

Under premium accounts, users will have access to premium hardware on the hardware selection page and some additional nodes with shorter wait times to deploy their workloads rapidly.

LLM Customer Support on the Edge

This sample allows users to build RAG-based LLM Chatbots for customer support use cases. Users can create a context, upload the files, and enter the prompts in the personal Q&A Assistant tab.

Uploading and embedding the context file can take time - sometimes over 10 minutes, depending on the content. But if you are facing high delays, try compressing the context file before uploading to optimize the process.
The application will be in a running state for 120 minutes. Users can manually stop the workload if termination is needed within 120 minutes.
After running the customer support sample, users can run a query to get the token per sec data along with CPU and Memory on the Telemetry dashboard.
When the application is running, users may notice that routes to the Streamlit URL redirect to the homepage. This can occur due to a change in the process monitor, where all service ports are switched to 8080, while the old template in the user namespace uses 8443. To resolve this issue, users must delete the LLM workload from the Applications page and re-launch it.

Create AI Model workflow

After model training is completed, the validate button will be disabled for 1 to 2 minutes. Users can validate the model after this wait time.
Images and annotations in the Dataset should correlate. Otherwise, the user will not be able to train the model.
Users can retrain the model once the active training is completed.
Users have a session timeout of 4 hours. If issues occur while using user flows for extended durations, log in again and retry the same user flow.
In dataset selection, the “Date modified” column may occasionally display an invalid date (Jan 1, 1970).