R News

Discover the latest R development news, repositories, and conferences at Hackertab.

Latest R articles

Statsmodels Library: An Overview

Statsmodels Library: An Overview

Table of Contents Introduction History When to use Installation Features Ordinary least...

12 May 2025devto

Fine-Tune SLMs in Colab for Free : A 4-Bit Approach with Meta Llama 3.2

Fine-Tune SLMs in Colab for Free : A 4-Bit Approach with Meta Llama 3.2

Fine-tuning large language models (LLMs) sounds complex — until you meet Unsloth. Whether you’re a...

11 May 2025devto

DOOM...*rendered* using a single DIV and CSS! 🤯🔫💥

DOOM...*rendered* using a single DIV and CSS! 🤯🔫💥

For clarity, I have not rebuilt DOOM in CSS...yet. No this is far simpler: rendering the output of...

10 May 2025devto

A Beginner’s Note on Machine Learning: Lessons from My Journey

A Beginner’s Note on Machine Learning: Lessons from My Journey

When I first came across the term Machine Learning, I thought it was something only people working at...

10 May 2025devto

💹 Build a Real-Time Crypto Arbitrage Bot Using Python and Graph Theory

💹 Build a Real-Time Crypto Arbitrage Bot Using Python and Graph Theory

Have you ever spotted price differences for the same crypto across exchanges and wondered if you...

10 May 2025devto

How to Build a Local RAG App with Ollama and ChromaDB in the R Programming Language

How to Build a Local RAG App with Ollama and ChromaDB in the R Programming Language

A Large Language Model (LLM) is a type of machine learning model that is trained to understand and generate human-like text. These models are trained on vast datasets to capture the nuances of human language, enabling them to generate coherent and contextually relevant responses. You can enhance the performance of an LLM by providing context — structured or unstructured data, such as documents, articles, or knowledge bases — tailored to the domain or information you want the model to specialize in. Using techniques like prompt engineering and context injection, you can build an intelligent chatbot capable of navigating extensive datasets, retrieving relevant information, and delivering responses. Whether it's storing recipes, code documentation, research articles, or answering domain-specific queries, an LLM-based chatbot can adapt to your needs with customization and privacy. You can deploy it locally to create a highly specialized conversational assistant that respects your data. In this article, you will learn how to build a local Retrieval-Augmented Generation (RAG) application using Ollama and ChromaDB in R. By the end, you'll have a custom conversational assistant with a Shiny interface that efficiently retrieves information while maintaining privacy and customization. Table of Contents What is RAG? Project Overview Project Setup Ollama Installation Data Collection and Cleaning How to Create Chunks How to Generate Sentence Embeddings How to Set Up the Vector Database for Embedding Storage How to Write the User Input Query Embedding Function Tool Calling How to Initialize the Chat System, Design Prompts, and Integrate Tools How to Interact with Your Chatbot Using a Shiny App Complete Code Conclusion What is RAG? Retrieval-Augmented Generation (RAG) is a method that integrates retrieval systems with generative AI, enabling chatbots to access recent and specific information from external sources. By using a retrieval pipeline, the chatbot can fetch up-to-date, relevant data and combine it with the generative model’s language capabilities, producing responses that are both accurate and contextually enriched. This makes RAG particularly useful for applications requiring fact-based, real-time knowledge delivery. Project Overview Project Setup Prerequisites Before you begin, ensure you have installed the latest version of the items listed here: RStudio: The IDE – RStudio is the primary workspace where you'll write and test your R code. Its user-friendly interface, debugging tools, and integrated environment make it ideal for data analysis and chatbot development. R: The Programming Language – R is the backbone of your project. You'll use it to handle data manipulation, apply statistical models, and integrate your recipe chatbot components seamlessly. Python – Some libraries, like the embedding library you'll use for text vectorization, are built on Python. It’s vital to have Python installed to enable these functionalities alongside your R code. Java – Java serves as a foundational element for certain embedding libraries. It ensures efficient processing and compatibility for text embedding tasks required to train your chatbot. Docker Desktop – Docker Desktop allows you to run ChromaDB, the vector database, locally on your machine. This enables fast and reliable storage of embeddings, ensuring your chatbot retrieves relevant information quickly. Ollama – Ollama brings powerful Large Language Models (LLMs) directly to your local computer, removing the need for cloud resources. It lets you access multiple models, customize outputs, and integrate them into your chatbot effortlessly. Ollama Installation Ollama is an open-sourced tool you can use to run and manage LLMs on your computer. Once installed, you can access various LLMs as per your needs. You will be using llama3.2:3b-instruct-q4_K_M model to build this chatbot. A quantized model is a version of a machine learning model that has been optimized to use less memory and computational power by reducing the precision of the numbers it uses. This enables you to use an LLM locally, especially when you don’t have access to a GPU (Graphics Processing Unit – a specialized processor that perform complex computations). To start, you can download and install the Ollama software here. Then you can confirm installation by running this command: ollama --version Run the following command to start Ollama: ollama serve Next, run the following command to pull the Q4_K_M quantization of llama3.2:3b-instruct: ollama pull llama3.2:3b-instruct-q4_K_M Then confirm that the model was extracted with this: ollama list If the model extraction was successful, a list containing the model’s name, ID, and size will be returned, like so: Now you can chat with the model: ollama run llama3.2:3b-instruct-q4_K_M If successful, you should receive a prompt that you can test by asking a question and getting an answer. For example: Then you can exit the console by typing /bye or ctrl + D Data Collection and Cleaning The chatbot you are building will be a cooking assistant that suggests recipes given your available ingredients, what you want to eat, and how much food a recipe yields. You first have to get the data to train the model. You will be using a dataset that contains recipes from Kaggle. To start, load the necessary libraries: # loading required libraries library(xml2) #read, parse, and manipulate XML,HTML documents library(jsonlite) #manipulate JSON objects library(RKaggle) # download datasets from Kaggle library(dplyr) # data manipulation Then download and save recipe dataset: # Download and read the "recipe" dataset from Kaggle recipes_list <- RKaggle::get_dataset("thedevastator/better-recipes-for-a-better-life") Inspect the dataframe and extract the first element like this: # inspect the dataset class(recipes_list) str(recipes_list) head(recipes_list) # extract the first tibble recipes_df <- recipes_list[[1]] A quick inspection of the recipes_list object shows that it contains two objects of type tibble. You will be using only the first element for this project. A tibble is a type of data structure used for storing and manipulating data. It’s similar to a traditional dataframe, but it’s designed to enforce stricter rules and perform fewer automatic actions compared to traditional dataframes. We’ll use a regular dataframe in this project because more people are likely familiar with it. It can also efficiently handle row indexing, which is crucial for accessing and manipulating specific rows in our recipe dataset. In the code block below, you’ll convert the tibble to a dataframe and then drop the first column, which is the index column. Then you’ll inspect the newly converted dataframe and drop unnecessary columns. Unnecessary columns are best removed to streamline the dataset and focus on relevant features. In this project, we’ll drop certain columns that aren’t particularly useful for training the chatbot. This ensures that the model concentrates on meaningful data to improve its accuracy and functionality. # convert to dataframe and drop the first column recipes_df <- as.data.frame(recipes_df[, -1]) # inspect the converted dataframe head(recipes_df) class(recipes_df) colnames(recipes_df) # drop unnecessary columns cleaned_recipes_df <- subset(recipes_df, select = -c(yield,rating,url,cuisine_path,nutrition,timing,img_src)) Now you need to identify rows with NA (missing) values, which you can do like this: # Identify rows and columns with NA values which(is.na(cleaned_recipes_df), arr.ind = TRUE) # a quick inspection reveals columns [2:4] have missing values subset_column_names <- colnames(cleaned_recipes_df)[2:4] subset_column_names It is important to handle NA values to ensure that your data is complete, to prevent errors, and to preserve context. Now, replace the NA values and confirm that there are no missing values: # Replace NA values dynamically based on conditions cols_to_modify <- c("prep_time", "cook_time", "total_time") cleaned_recipes_df[cols_to_modify] <- lapply( cleaned_recipes_df[cols_to_modify], function(x, df) { # Replace NA in prep_time and cook_time where both are NA replace(x, is.na(df$prep_time) & is.na(df$cook_time), "unknown") }, df = cleaned_recipes_df # Pass the whole dataframe for conditions ) cleaned_recipes_df <- cleaned_recipes_df %>% mutate( prep_time = case_when( # If cooktime is present but preptime is NA, replace with "no preparation required" !is.na(cook_time) & is.na(prep_time) ~ "no preparation required", # Otherwise, retain original value TRUE ~ as.character(prep_time) ), cook_time = case_when( # If prep_time is present but cook_time is NA, replace with "no cooking required" !is.na(prep_time) & is.na(cook_time) ~ "no cooking required", # Otherwise, retain original value TRUE ~ as.character(cook_time) ) ) # confirm there are no missing values any(is.na(cleaned_recipes_df)) ) # confirm the replacing NA logic works by inspecting specific rows cleaned_recipes_df[1081,] cleaned_recipes_df[1,] cleaned_recipes_df[405,] For this tutorial, we’ll subset the dataframe to the first 250 rows for demo purposes. This saves on time when it comes to generating embeddings. # recommended for demo/learning purposes cleaned_recipes_df <- head(cleaned_recipes_df,250) How to Create Chunks To understand why chunking is important before embedding, you need to understand what an embedding is. An embedding is a vectoral representation of a word or a sentence. Machines don’t understand human text – they understand numbers. LLMs work by transforming human text to numerical representations in order to give answers. The process of generating embeddings requires a lot of computation, and breaking down the data to be embedded optimizes the embedding process. So now we’re going to split the dataframe into smaller chunks of a specified size to enable efficient batch processing and iteration. # Define the size of each chunk (number of rows per chunk) chunk_size <- 1 # Get the total number of rows in the dataframe n <- nrow(cleaned_recipes_df) # Create a vector of group numbers for chunking # Each group number repeats for 'chunk_size' rows # Ensure the vector matches the total number of rows r <- rep(1:ceiling(n/chunk_size), each = chunk_size)[1:n] # Split the dataframe into smaller chunks (subsets) based on the group numbers chunks <- split(cleaned_recipes_df, r) How to Generate Sentence Embeddings As previously mentioned, embeddings are vector representations of words or sentences. Embeddings can be generated from both words and sentences. How you choose to generate embeddings depends on your intended application of the LLM. Word embeddings are numerical representations of individual words in a continuous vector space. They capture semantic relationships between words, allowing similar words to have vectors close to each other. Word embeddings can be used in search engines as they support word-level queries by matching embeddings to retrieve relevant documents. They can also be used in text classification to classify documents, emails, or tweets based on word-level features (for example, detecting spam emails or sentiment analysis). Sentence embeddings are numerical representations of entire sentences in a vector space, designed to capture the overall meaning and context of the sentence. They are used in settings where sentences provide better context like question answering systems where user queries are matched to relevant sentences or documents for more precise retrieval. For our recipe chatbot, sentence embedding is the best choice. First, create an empty dataframe that has three columns. #empty dataframe recipe_sentence_embeddings <- data.frame( recipe = character(), recipe_vec_embeddings = I(list()), recipe_id = character() ) The first column will hold the actual recipe in text form, the recipe_vec_embeddings column will hold the generated sentence embeddings, and the recipe_id holds a unique id for each recipe. This will help in indexing and retrieval from the vector database. Next, it’s helpful to define a progress bar, which you can do like this: # create a progress bar pb <- txtProgressBar(min = 1, max = length(chunks), style = 3) Embedding can take a while, so it’s important to keep track of the progress of the process. Now it’s time to generate embeddings and populate the dataframe. Write a for loop that executes the code block as long as the length of the chunks. for (i in 1:length(chunks)) {} The recipe field is the text at the chunk that is currently being executed and the unique chunk id is generated by pasting the index of the chunk and the text “chunk”. for (i in 1:length(chunks)) { recipe <- as.character(chunks[i]) recipe_id <- paste0("recipe",i) } The text embed function from the text library generates either sentence or word embeddings. It takes in a character variable or a dataframe and produces a tibble of embeddings. You can read loading instructions here for smooth running of the text library. The batch_size defines how many rows are embedded at a time from the input. Setting the keep_token_embeddings discards the embeddings for individual tokens after processing, and aggregation_from_layers_to_tokens “concatenates” or combines embeddings from specified layers to create detailed embeddings for each token. A token is the smallest unit of text that a model can process. for (i in 1:length(chunks)) { recipe <- as.character(chunks[i]) recipe_id <- paste0("recipe",i) recipe_embeddings <- textEmbed(as.character(recipe), layers = 10:11, aggregation_from_layers_to_tokens = "concatenate", aggregation_from_tokens_to_texts = "mean", keep_token_embeddings = FALSE, batch_size = 1 ) } In order to specify sentence embeddings, you need to set the argument to the aggregation_from_tokens_to_texts parameter as "mean". aggregation_from_tokens_to_texts = "mean" The "mean" operation averages the embeddings of all tokens in a sentence to generate a single vector that represents the entire sentence. This sentence-level embedding captures the overall meaning and semantics of the text, regardless of its token length. # convert tibble to vector recipe_vec_embeddings <- unlist(recipe_embeddings, use.names = FALSE) recipe_vec_embeddings <- list(recipe_vec_embeddings) The embedding function returns a tibble object. In order to obtain a vector embedding, you need to first unlist the tibble and drop the row names and then list the result to form a simple vector. # Append the current chunk's data to the dataframe recipe_sentence_embeddings <- recipe_sentence_embeddings %>% add_row( recipe = recipe, recipe_vec_embeddings = recipe_vec_embeddings, recipe_id = recipe_id ) Finally, update the empty dataframe after each iteration with the newly generated data. # track embedding progress setTxtProgressBar(pb, i) In order to keep track of the embedding progress, you can use the earlier defined progress bar inside the loop. It will update at the end of every iteration. Complete Code Block: # load required library library(text) # # ensure to read loading instructions here for smooth running of the 'text' library # # https://www.r-text.org/ # embedding data for (i in 1:length(chunks)) { recipe <- as.character(chunks[i]) recipe_id <- paste0("recipe",i) recipe_embeddings <- textEmbed(as.character(recipe), layers = 10:11, aggregation_from_layers_to_tokens = "concatenate", aggregation_from_tokens_to_texts = "mean", keep_token_embeddings = FALSE, batch_size = 1 ) # convert tibble to vector recipe_vec_embeddings <- unlist(recipe_embeddings, use.names = FALSE) recipe_vec_embeddings <- list(recipe_vec_embeddings) # Append the current chunk's data to the dataframe recipe_sentence_embeddings <- recipe_sentence_embeddings %>% add_row( recipe = recipe, recipe_vec_embeddings = recipe_vec_embeddings, recipe_id = recipe_id ) # track embedding progress setTxtProgressBar(pb, i) } How to Set Up the Vector Database for Embedding Storage A vector database is a special type of database that stores embeddings and allows you to query and retrieve relevant information. There are numerous vector databases available, but for this project, you will use ChromaDB, an open-source option that integrates with the R environment through the rchroma library. ChromaDB runs locally in a Docker container. Just make sure you have Docker installed and running on your device. Then load the rchroma library and run your ChromaDB instance: # load rchroma library library(rchroma) # run ChromaDB instance. chroma_docker_run() If it was successful, you should see this in the console: Next, connect to a local ChromaDB instance and check the connection: # Connect to a local ChromaDB instance client <- chroma_connect() # Check the connection heartbeat(client) version(client) Now you’ll need to create a collection and confirm that it was created. Collections in ChromaDB function similarly to tables in conventional databases. # Create a new collection create_collection(client, "recipes_collection") # List all collections list_collections(client) Now, add embeddings to the collection. To add embeddings to the recipes_collection, use the add_documents function. # Add documents to the collection add_documents( client, "recipes_collection", documents = recipe_sentence_embeddings$recipe, ids = recipe_sentence_embeddings$recipe_id, embeddings = recipe_sentence_embeddings$recipe_vec_embeddings ) The add_documents() function is used to add recipe data to the recipes_collection. Here's a breakdown of its arguments and how the corresponding data is accessed: documents: This argument represents the recipe text. It is sourced from the recipe column of the recipe_sentence_embeddings dataframe. ids: This is the unique identifier for each recipe. It is extracted from the recipe_id column of the same dataframe. embeddings: This contains the sentence embeddings, which were previously generated for each recipe. These embeddings are accessed from the recipe_vec_embeddings column of the dataframe. All three arguments—documents, ids, and embeddings—are obtained by subsetting their respective columns from the recipe_sentence_embeddings dataframe. How to Write the User Input Query Embedding Function In order to retrieve information from a vector database, you must first embed your query text. The database compares your query's embedding with its stored embeddings to find and retrieve the most relevant document. It's important to ensure that the dimensions (rows × columns) of your query embedding match those of the database embeddings. This alignment is achieved by using the same embedding model to generate your query. Matching embeddings involves calculating the similarity (for example, cosine similarity) between the query and stored embeddings, identifying the closest match for effective retrieval. Let’s write a function that allows us to embed a query which then queries similar documents using the generated embeddings. Wrapping it in a function makes it reusable. #sentence embeddings function and query question <- function(sentence){ sentence_embeddings <- textEmbed(sentence, layers = 10:11, aggregation_from_layers_to_tokens = "concatenate", aggregation_from_tokens_to_texts = "mean", keep_token_embeddings = FALSE ) # convert tibble to vector sentence_vec_embeddings <- unlist(sentence_embeddings, use.names = FALSE) sentence_vec_embeddings <- list(sentence_vec_embeddings) # Query similar documents using embeddings results <- query( client, "recipes_collection", query_embeddings = sentence_vec_embeddings , n_results = 2 ) results } This chunk of code is similar to how we have previously used the text_embed() function. The query() function is added to enable querying the vector database, particularly the recipes' collection, and returns the top two documents that closely match a user’s query. Our function thus takes in a sentence as an argument and embeds the sentence to generate sentence embeddings. It then queries the database and returns two documents that match the query most. Tool Calling To interact with Ollama in R, you will utilize the ellmer library. This library streamlines the use of large language models (LLMs) by offering an interface that enables seamless access to and interaction with a variety of LLM providers. To enhance the LLM’s usage, we need to provide context to it. You can do this by tool calling. Tool calling allows an LLM to access external resources in order to enhance its functionality. For this project, we are implementing Retrieval-Augmented Generation (RAG), which combines retrieving relevant information from a vector database and generating responses using an LLM. This approach improves the chatbot's ability to provide accurate and contextually relevant answers. Now, define a function that links to the LLM to provide context using the tool() function from the ellmer library. # load ellmer library library(ellmer) # function that links to llm to provide context tool_context <- tool( question, "obtains the right context for a given question", sentence = type_string() ) The tool() function takes the question function that returns the relevant documents that we’ll use as context as the first argument. We’ll use the documents to help the LLM answer questions accordingly. The text, "obtains the right context for a given question", is a description of what the tool will be doing. Finally, the sentence = type_string() defines what type of object the question() function expects. How to Initialize the Chat System, Design Prompts, and Integrate Tools Next, you’ll set up a conversational AI system by defining its role and functionality. Using system prompt design, you will shape the assistant’s behavior, tone, and focus as a culinary assistant. You’ll also integrate external tools to extend the chatbot’s capabilities by registering tools. Let’s dive in. First, you need to initialize a Chat Object: # Initialize the chat system with propmpt instructions. chat <- chat_ollama(system_prompt = "You are a knowledgeable culinary assistant specializing in recipe recommendations. You provide tailored meal suggestions based on the user's available ingredients and the desired amount of food or servings. Ensure the recipes align closely with the user's inputs and yield the expected quantity.", model = "llama3.2:3b-instruct-q4_K_M") You can do that using the chat_ollama() function. This sets up a conversational agent with the specified system prompt and model. The system prompt defines the conversational behavior, tone, and focus of the LLM while the model argument specifies the language model (llama3.2:3b-instruct-q4_K_M) that the chat system will use to generate responses. Next, you need to register a tool. #register tool chat$register_tool(tool_context) We need to tell our chat object about our tool_context() function. Do this by registering a tool using the register_tool() function. How to Interact with Your Chatbot Using a Shiny App To interact with the chatbot you’ve just created, we’ll use Shiny, a framework for building interactive web applications in R. Shiny provides a user-friendly graphical interface that allows seamless interaction with the chatbot. For this purpose, we’ll use the shinychat library, which simplifies the process of building a chat interface within a Shiny app. This involves defining two key components: User Interface (UI): Responsible for the visual layout and what the user sees. In this case, chat_ui("chat") is used to create the interactive chat interface. Server Function: Handles the functionality and logic of the application. It connects the chatbot to external tools and manages processes like embedding queries, retrieving relevant responses, and handling user inputs. # load the required library library(shinychat) # wrap the chat code in a Shiny App ui <- bslib::page_fluid( chat_ui("chat") ) server <- function(input, output, session) { # Connect to a local ChromaDB instance running on docker with embeddings loaded client <- chroma_connect() #sentence embeddings function and query question <- function(sentence){ sentence_embeddings <- textEmbed(sentence, layers = 10:11, aggregation_from_layers_to_tokens = "concatenate", aggregation_from_tokens_to_texts = "mean", keep_token_embeddings = FALSE ) # convert tibble to vector sentence_vec_embeddings <- unlist(sentence_embeddings, use.names = FALSE) sentence_vec_embeddings <- list(sentence_vec_embeddings) # Query similar documents using embeddings results <- query( client, "recipes_collection", query_embeddings = sentence_vec_embeddings , n_results = 2 ) results } # function that provides context tool_context <- tool( question, "obtains the right context for a given question", sentence = type_string() ) # Initialize the chat system with the first chunk chat <- chat_ollama(system_prompt = "You are a knowledgeable culinary assistant specializing in recipe recommendations. You provide tailored meal suggestions based on the user's available ingredients and the desired amount of food or servings. Ensure the recipes align closely with the user's inputs and yield the expected quantity.", model = "llama3.2:3b-instruct-q4_K_M") #register tool chat$register_tool(tool_context) observeEvent(input$chat_user_input, { stream <- chat$stream_async(input$chat_user_input) chat_append("chat", stream) }) } shinyApp(ui, server) Alright, let’s understand how this is working: User input monitoring with observeEvent(): The observeEvent() block monitors user inputs from the chat interface (input$chat_user_input). When a user sends a message, the chatbot processes it, retrieves relevant context using the embeddings, and streams the response dynamically to the chat interface. Tool calling for context: The chatbot employs tool calling to interact with external resources (like the vector database) and enhance its functionality. In this project, Retrieval-Augmented Generation (RAG) ensures the chatbot provides accurate and context-rich responses by integrating retrieval and generation seamlessly. This approach brings the chatbot to life, enabling users to interact with it dynamically through a responsive Shiny app. Complete Code The R scripts have been split in two, with data.R containing code that handles data gathering and cleaning, text chunking, sentence embeddings generation, creating a vector database, and loading documents to it. The chat.R script contains code that handles user input querying, context retrieval, chat initialization, system prompt design, tool integration, and a chat Shiny app. data.R # install and load required packages # install devtools from CRAN install.packages('devtools') devtools::install_github("benyamindsmith/RKaggle") library(text) library(rchroma) library(RKaggle) library(dplyr) # run ChromaDB instance. chroma_docker_run() # Connect to a local ChromaDB instance client <- chroma_connect() # Check the connection heartbeat(client) version(client) # Create a new collection create_collection(client, "recipes_collection") # List all collections list_collections(client) # Download and read the "recipe" dataset from Kaggle recipes_list <- RKaggle::get_dataset("thedevastator/better-recipes-for-a-better-life") # extract the first tibble recipes_df <- recipes_list[[1]] # convert to dataframe and drop the first column recipes_df <- as.data.frame(recipes_df[, -1]) # drop unnecessary columns cleaned_recipes_df <- subset(recipes_df, select = -c(yield,rating,url,cuisine_path,nutrition,timing,img_src)) ## Replace NA values dynamically based on conditions # Replace NA when all columns have NA values cols_to_modify <- c("prep_time", "cook_time", "total_time") cleaned_recipes_df[cols_to_modify] <- lapply( cleaned_recipes_df[cols_to_modify], function(x, df) { # Replace NA in prep_time and cook_time where both are NA replace(x, is.na(df$prep_time) & is.na(df$cook_time), "unknown") }, df = cleaned_recipes_df ) # Replace NA when either or columns have NA values cleaned_recipes_df <- cleaned_recipes_df %>% mutate( prep_time = case_when( # If cook_time is present but prep_time is NA, replace with "no preparation required" !is.na(cook_time) & is.na(prep_time) ~ "no preparation required", # Otherwise, retain original value TRUE ~ as.character(prep_time) ), cook_time = case_when( # If prep_time is present but cook_time is NA, replace with "no cooking required" !is.na(prep_time) & is.na(cook_time) ~ "no cooking required", # Otherwise, retain original value TRUE ~ as.character(cook_time) ) ) # chunk the dataset chunk_size <- 1 n <- nrow(cleaned_recipes_df) r <- rep(1:ceiling(n/chunk_size),each = chunk_size)[1:n] chunks <- split(cleaned_recipes_df,r) #empty dataframe recipe_sentence_embeddings <- data.frame( recipe = character(), recipe_vec_embeddings = I(list()), recipe_id = character() ) # create a progress bar pb <- txtProgressBar(min = 1, max = length(chunks), style = 3) # embedding data for (i in 1:length(chunks)) { recipe <- as.character(chunks[i]) recipe_id <- paste0("recipe",i) recipe_embeddings <- textEmbed(as.character(recipe), layers = 10:11, aggregation_from_layers_to_tokens = "concatenate", aggregation_from_tokens_to_texts = "mean", keep_token_embeddings = FALSE, batch_size = 1 ) # convert tibble to vector recipe_vec_embeddings <- unlist(recipe_embeddings, use.names = FALSE) recipe_vec_embeddings <- list(recipe_vec_embeddings) # Append the current chunk's data to the dataframe recipe_sentence_embeddings <- recipe_sentence_embeddings %>% add_row( recipe = recipe, recipe_vec_embeddings = recipe_vec_embeddings, recipe_id = recipe_id ) # track embedding progress setTxtProgressBar(pb, i) } # Add documents to the collection add_documents( client, "recipes_collection", documents = recipe_sentence_embeddings$recipe, ids = recipe_sentence_embeddings$recipe_id, embeddings = recipe_sentence_embeddings$recipe_vec_embeddings ) chat.R # Load required packages library(ellmer) library(text) library(rchroma) library(shinychat) ui <- bslib::page_fluid( chat_ui("chat") ) server <- function(input, output, session) { # Connect to a local ChromaDB instance running on docker with embeddings loaded client <- chroma_connect() # sentence embeddings function and query question <- function(sentence){ sentence_embeddings <- textEmbed(sentence, layers = 10:11, aggregation_from_layers_to_tokens = "concatenate", aggregation_from_tokens_to_texts = "mean", keep_token_embeddings = FALSE ) # convert tibble to vector sentence_vec_embeddings <- unlist(sentence_embeddings, use.names = FALSE) sentence_vec_embeddings <- list(sentence_vec_embeddings) # Query similar documents results <- query( client, "recipes_collection", query_embeddings = sentence_vec_embeddings , n_results = 2 ) results } # function that provides context tool_context <- tool( question, "obtains the right context for a given question", sentence = type_string() ) # Initialize the chat system chat <- chat_ollama(system_prompt = "You are a knowledgeable culinary assistant specializing in recipe recommendations. You provide tailored meal suggestions based on the user's available ingredients and the desired amount of food or servings. Ensure the recipes align closely with the user's inputs and yield the expected quantity.", model = "llama3.2:3b-instruct-q4_K_M") #register tool chat$register_tool(tool_context) observeEvent(input$chat_user_input, { stream <- chat$stream_async(input$chat_user_input) chat_append("chat", stream) }) } shinyApp(ui, server) You can find the complete code here. Conclusion Building a local Retrieval-Augmented Generation (RAG) application using Ollama and ChromaDB in R programming offers a powerful way to create a specialized conversational assistant. By leveraging the capabilities of large language models and vector databases, you can efficiently manage and retrieve relevant information from extensive datasets. This approach not only enhances the performance of language models but also ensures customization and privacy by running the application locally. Whether you're developing a cooking assistant or any other domain-specific chatbot, this method provides a robust framework for delivering intelligent and contextually aware responses. ...

14 April 2025hashnode

How to Build a Weather App with R Shiny

How to Build a Weather App with R Shiny

In this tutorial, you’ll learn how to build a weather app in R. Really – a weather app, in R? Wait, hear me out. When you think of R, you probably imagine someone wearing chunky thick prescription glasses and devouring a book. You know, a statistician dealing with complex models, an insane amount of mathematical equations, and copious amounts of data. But R is far more than just a tool for statistics. It shines when you need to turn raw data into actionable insights and present those insights in a clear, engaging way. With frameworks like Shiny, R takes this one step further, enabling you to create fully interactive web apps without having to worry about frontends, backends, or learning an entirely new programming language. In this tutorial, you will create a simple weather app that fetches data from an API and displays the results in a good-looking app. Table of Contents Project Overview Project Setup API Keys: Storage and Retrieval How to Make Your First API Call How to Build the Shiny App Conclusion Project Overview Here’s what we’re going to be building: For the weather app to work, you will need to make two separate API calls. We’ll use the One Call API 3.0 to update weather data and the OpenWeather API for geocoding. You can get your API Key here. Just keep in mind that if this is your first time signing up for an API key, activation may take up to 24 hours. The weather app will take the location/city from user input. The input will then be geocoded by making the call to OpenWeather API. Then, from its response, the coordinates (latitude and longitude) will be extracted. The coordinates will be used as query arguments for the One Call API call to obtain the weather data in JSON format. Prerequisites: To follow along with this tutorial, you will need: R programming knowledge HTML and a bit of JavaScript knowledge R Studio installed Project Setup Create a folder in your desired directory. Set and confirm the project folder as the working directory using the following command in the R console: setwd("path/to/your/project/file") getwd() Create a project in the set path using the following command: #create R project usethis::create_project(path = ".", open = FALSE) You should have a folder structure that looks like this. Create an R file in the root directory and save it as app.R. All your R code will be contained here. Install and load the following libraries that you are going to work with: library(shiny) library(bslib) library(shinyjs) library(httr2) library(lubridate) library(shiny.semantic) API Keys: Storage and Retrieval Storing your credentials in a location separate from your scripts and global environment is a good practice. This ensures security, scalability, and flexibility, especially when working in shared or production environments. The .Renviron file best serves that purpose. Open and edit your .Renviron file in the following way: #open and edit .Renviron usethis::edit_r_environ(scope=c("project") The scope argument set to project sets up the .Renviron specifically to your project. In the newly opened file, add your API key as follows: OPENWEATHERAPIKEY="yourapikey" How to Make Your First API Call You will be using the httr2 library (built based on httr) to obtain data from the API. It grants you more control over how you make requests to the web. Make the API Key accessible in the script First, you’ll need to securely access and store the API key in the script without hardcoding it. You can do that like this: #access API keys in script readenviron(".Renviron") api_key = Sys.getenv("OPENWEATHERAPIKEY") Define the Geocoding Function You will create a function that takes a location and an API key as inputs, sends a request to the OpenWeather geocoding API, and returns the coordinates of the specified location. Start by creating a request. The pipe (|>) operator facilitates the chaining of HTTP requests step by step in a clear and readable manner. The geocoding URL takes two parameters: location, denoted by q, and the API key, denoted by app_id. The req_url_query() function appends these parameters to the query. Chain the query to perform the request and fetch action, and finally obtain the response in JSON format using the second to last line. # Geocoding URL geocoding_url <- "https://api.openweathermap.org/data/2.5/weather" geocode <- function(location, api_key) { request(geocoding_url) |> req_url_query(`q` = location, `appid` = api_key) |> req_perform() |> resp_body_json() |> coordinates() } Define the coordinate-extracting function The coordinates() function is a helper function that extracts the latitude and longitude values from the JSON response. A quick inspection of the JSON response reveals the coordinate's position. The JSON object is simply a long list of lists and you can access elements by subsetting it. A blank data body would imply that the city/location is unavailable, and you’d get the message "No such city exists!". If the JSON contains an element, the length would be more than 0 – it is a list after all. coordinates <- function(body) { if(length(body) != 0) { lat <- body$coord$lat lng <- body$coord$lon town <- body$name c(lat, lng, town) } else { "No such city exists!" } } Define the weather-update function You will create a function that sends a request to the OpenWeather API with specified query parameters, handles errors using a predefined function, and returns the parsed JSON response containing the weather data. As implemented in the geocoding function, start by creating a request and adding the necessary query parameters using the req_url_query() function. The openweather_json() function accepts two main arguments: api_key: This is a required argument used for authentication with the OpenWeather API matched by position. ...: This represents optional keyword arguments that you can use to customize the query. You can pass as many additional parameters as needed, provided they are specified as named arguments. openweather_json <- function(api_key, ...) { request(current_weather_url) |> req_url_query(..., `appid` = api_key, `units` = "metric") |> req_error(body = openweather_error_body) |> req_perform() |> resp_body_json() } Error Handling: Extracting and Managing Status Codes You will create an error-handling function that extracts non-200 status codes from a response and defines how to manage them. The structure of this function depends on how the API reports errors and where the relevant information is stored. Define the weather-update error body The req_error() in openweather_json() introduces a new concept: error handling. API requests may throw exceptions, and getting the status codes helps you know what message to show the user and how to resolve it. Create an error body which is a function that captures the error code if the status code is not 200 (which means everything is OK). The function takes a response and extracts the status response stored in the JSON response at the $message sublist. The underscore (_)is a placeholder for the JSON object. openweather_error_body <- function(resp) { resp |> resp_body_json() |> _$message } Define the geocode error body This error body function will prove useful in the Shiny App. This is a simple walkthrough. The req_error() function allows you to customize how response errors are handled. Its is_error argument determines whether a given response should be considered an error. By setting is_error to \(resp) FALSE (an anonymous function that always returns FALSE), all responses, regardless of the status code, are treated as successful. This prevents the app from exiting due to non-200 status codes. With this setup, you can extract the status code from the response body and pipe it into the resp_status() function to retrieve the exact code. openstreetmap_error_body <- function(location, api_key) { resp <- request(geocoding_url) |> req_url_query(`q` = location, `appid` = api_key) |> req_error(is_error = \(resp) FALSE) |> req_perform() |> resp_status() resp } How to Build the Shiny App Now that you have nailed down how to obtain data from the API, it’s time to render the results in an interpretable and interactive format. For this, you will use Shiny. Shiny is a framework that allows you to create interactive web apps. A Shiny App is made up of two components: The UI: what the user interacts with. It defines the layout and appearance of the app. The server: contains the app’s logic and behaviour. Building the Shiny UI Shiny UI provides a collection of elements that allow users to input data, make selections, and trigger events seamlessly. You will include a textInput element that takes in the location and the weather data will be fetched and rendered upon submission. The input_task_button button prevents the user from clicking when an API call is in progress. The other elements are output elements where the weather data will be displayed and a mode-switching button. Styling the Shiny app You can use shiny.semantic, a library built on top of Fomantic-UI, to style your Shiny dashboard. Fomantic-UI is a front-end framework that provides a rich collection of pre-styled HTML components like buttons, modals, form inputs, and more. It simplifies UI design by allowing developers to create visually appealing and responsive interfaces without needing extensive custom CSS or HTML knowledge. Fomantic-UI styling is applied by wrapping elements in their corresponding classes, which define their behavior and appearance. A grid in Fomantic-UI is a flexible layout system used to organize content. It acts as a canvas that divides the layout into rows (horizontally aligned) and columns (vertically aligned). A root grid can contain up to 16 columns, making it ideal for creating structured and responsive designs. To specify a column's width, you append classes like wide and the size (a number from 1 to 16) to represent its span. The total width of all columns in a row should sum up to 16. A segment groups related content, while a card displays detailed, content-rich items, such as a user's social media profile. Dividers are visual elements used to separate sections or content within a layout. For the weather app, first create a div of class grid within which you’ll nest the various elements. Search bar section Divide the grid into sixteen columns and create a segment that groups elements in the search bar section. Add a theme toggle button, location input that takes in user input, a search button for submitting the location to the API, and a notification button, defining their width by the column size. div(class = "sixteen wide column", div(class = "ui segment", div(class = "ui grid", div(class = "two wide column", button( class = "ui button icon basic", input_id = "darkmode", label = NULL, icon = icon("moon icon") ) ), div(class = "ten wide column", textInput( "location", label = NULL, placeholder = "Search for your preferred city" ) ), div(class = "two wide column", tags$div( class = "ui button", id = "my-custom-button", input_task_button("search", label = "Search", icon = icon("search")) ) ), div(class = "two wide column", actionButton("show_alert", label = icon("bell"), class = "bell-no-alert"), textOutput("alert_message") ) ) ) ) Location and current weather section Divide the grid into sixteen columns and nest another grid within the partitions that will host two columns. Within the grid, define two columns. The first column is for time, location, and date data, and the second column will hold current weather data. Then create card elements to hold each weather parameter, its unit of measurement, and the corresponding icon. div(class = "sixteen wide column", div(class = "ui equal-height-grid grid", div(class = "left floated center aligned four wide column", div(class = "ui raised equal-height-two-segment segment", style = "flex: 1;", div(class = "column center aligned", div(class = "ui hidden section divider"), span(class = "ui large text", textOutput("city")), div(class = "ui hidden section divider"), span(class = "ui big text", textOutput("currentTime")), div(class = "ui hidden section divider"), span(class = "ui large text", textOutput("currentDate")), div(class = "ui hidden section divider") ) ) ), div(class = "right floated center aligned twelve wide column", div(class = "ui raised segment", div(class = "ui horizontal equal width segments", div(class = "ui equal-height-two-segment segment", style = "flex: 3;", div(class = "column", span(class = "ui big text centered", textOutput("currentTemp")), textOutput("feelsLike"), card( class = "ui mini", div(class = "content", icon(class = "large sun"), div(class = "sub header", "Sunrise"), div(class = "description", textOutput("sunriseTime")) ) ), card( class = "ui mini", div(class = "content", icon(class = "large moon"), div(class = "sub header", "Sunset"), div(class = "description", textOutput("sunsetTime")) ) ) ) ), div(class = "ui segment", style = "flex: 3;", div( class = "column center aligned", div(class = "ui hidden divider"), htmlOutput("currentWeatherIcon"), span(class = "ui large text", textOutput("currentWeatherDescription")) ) ), div(class = "ui segment", style = "flex: 3;", div(class = "column", card( class = "ui tiny", div(class = "content", icon(class = "big tint"), div(class = "sub header", "Humidity"), div(class = "description", textOutput("currentHumidity")) ) ), card( class = "ui tiny", div(class = "content", icon(class = "big tachometer alternate"), div(class = "sub header", "Pressure"), div(class = "description", textOutput("currentPressure")) ) ) ) ), div(class = "ui segment", style = "flex: 3;", div(class = "column center aligned", card( class = "ui tiny", div(class = "content", icon(class = "big wind"), div(class = "sub header", "Wind Speed"), div(class = "description", textOutput("currentWindSpeed")) ) ), card( class = "ui tiny", div(class = "content", icon(class = "big umbrella"), div(class = "sub header", "UV Index"), div(class = "description", textOutput("currentUV")) ) ) ) ) ) ) ) ) ) Forecast section This section holds the forecasted data. Divide the grid into sixteen columns and nest another grid within the partitions hosting two columns. Within the grid, define two columns. The first column holds the 5-Day Forecast data. Separate the elements containing different values using rows. The second column contains Hourly Forecast data. Separate the elements containing different values using columns. # Forecast section div(class = "sixteen wide column", div(class = "ui grid equal-height-grid", div(class = "left floated center aligned six wide column", div(class = "ui raised segment special-segment equal-height-segment", h4("5 Days Forecast:"), div(class = "ui three column special-column grid", # Day forecasts div(class = "row", div(class = "five wide column", textOutput("dailyDtOne")), div(class = "three wide column", textOutput("dailyTempOne")), div(class = "three wide column", htmlOutput("dailyIconOne")) ), div(class = "row", div(class = "five wide column", textOutput("dailyDtTwo")), div(class = "three wide column", textOutput("dailyTempTwo")), div(class = "three wide column", htmlOutput("dailyIconTwo")) ), div(class = "row", div(class = "five wide column", textOutput("dailyDtThree")), div(class = "three wide column", textOutput("dailyTempThree")), div(class = "three wide column", htmlOutput("dailyIconThree")) ), div(class = "row", div(class = "five wide column", textOutput("dailyDtFour")), div(class = "three wide column", textOutput("dailyTempFour")), div(class = "three wide column", htmlOutput("dailyIconFour")) ), div(class = "row", div(class = "five wide column", textOutput("dailyDtFive")), div(class = "three wide column", textOutput("dailyTempFive")), div(class = "three wide column", htmlOutput("dailyIconFive")) ) ) ) ), div(class = "right floated center aligned ten wide column", div(class = "ui raised segment special-segment equal-height-segment", h4("Hourly Forecast:"), div( class = "ui grid", style = "display: flex; flex-direction: row; align-items: center; justify-content: space-around; flex-wrap: wrap; height: 100%;", # Hourly forecasts div(class = "column", textOutput("hourlyDtOne"), htmlOutput("hourlyIconOne"), textOutput("hourlyTempOne") ), div(class = "column", textOutput("hourlyDtTwo"), htmlOutput("hourlyIconTwo"), textOutput("hourlyTempTwo") ), div(class = "column", textOutput("hourlyDtThree"), htmlOutput("hourlyIconThree"), textOutput("hourlyTempThree") ), div(class = "column", textOutput("hourlyDtFour"), htmlOutput("hourlyIconFour"), textOutput("hourlyTempFour") ), div(class = "column", textOutput("hourlyDtFive"), htmlOutput("hourlyIconFive"), textOutput("hourlyTempFive") ) ) ) ) ) ) ) Building the Shiny Server Each element in the UI section has an ID (unique identifier) that is used to manipulate what data/information will be displayed to it. The render*() set of functions defines the visualization type while the output$* functions subset elements. These two are used to link the visual to the logic. Most elements will have data extracted from the JSON list, except for the weather icons (for which an external link as a source will be referenced). Reactivity Reactivity is what makes Shiny apps dynamic—outputs automatically update when their dependencies change. Two key components of reactivity are reactives and observers. A reactive computes and returns a value based on its dependencies, while an observer monitors reactive values and runs code that causes side effects, like logging or updating a database. To control reactivity, you can use bindEvent() to delay execution until a specific event occurs or observeEvent() to listen for a user action and trigger a code block. Together, these tools provide flexibility for managing app behavior. The Server Code location reactive The location reactive includes an if-else conditional block that defines what message to display depending on the status code. The query variable contains the city/location that will be geocoded to obtain coordinates. The flow is piped to bindEvent(). This ensures the geocoding API call is completed before another call can be made, which reduces unnecessary requests. location <- reactive({ query <- input$location if(openstreetmap_error_body(query, api_key) == "404"){ validate("No such city/town exists. Check your spelling!") } else if(openstreetmap_error_body(query, api_key) == "400"){ validate("Bad request") } coords <- geocode(query, api_key) }) %>% bindEvent(input$search) weather_data reactive The weather reactive combines a geocoding API call and a weather update API call using coordinates obtained and extracted from location(): weather_data <- reactive({ loc <- location() openweather_json(api_key, lat = loc[1], lon = loc[2]) }) To access the JSON objects returned by the API call, you call the reactive as if it were a function. The specific values to be extracted can then be accessed by subsetting the JSON value. # subsetting weather data. output$city <- renderText({ location()[3] }) output$currentWeatherDescription <- renderText({ weather_data()$current$weather[[1]]$description }) Create a Parse Date function All the time data in the JSON response, forecasted or current, is provided in UNIX format. To make this information user-friendly, it needs to be converted into a human-readable format. You can do this by creating a function that takes the time data as input and uses functions from the lubridate package to handle the conversion. First, convert the timestamp element to a datetime object. Format the time item to a 12-hour clock system and a date item to include the day of the week, the date, and the month. %I: Displays the hour in a 12-hour clock format (01-12). %M: Displays the minutes (00-59). %p: Adds the AM/PM indicator. The paste function concatenates the values. The function returns a vector containing date and time values to be extracted by subsetting. parse_date <- function(timestamp) { datetime <- as_datetime(timestamp) date <- paste(weekdays(datetime), ",", day(datetime), months(datetime)) time <- format(as.POSIXct(datetime), format = "%I:%M %p") c(date, time) } Add a modal to display error messages The location reactive provides a way to handle errors. You can incorporate a modal to enhance the user experience by overlaying the page and disabling its content until the user completes a specified action whenever an error occurs. You’ll add JavaScript to control when and how the modal shows. Add two modals in the UI section, each featuring an explanation of the error (header) and an outline of the required action (content). The action class includes a button that enables the user to close the modal. # modals - UI div(id = "notFound", class = "ui modal", div(class = "header", "Location Not Found"), div(class = "content", "No such city/town exists. Check your spelling!"), div(class = "actions", div(class = "ui button", id = "closeNotFound", "OK")) ), div(id = "badRequest", class = "ui modal", div(class = "header", "Invalid Request"), div(class = "content", "Bad request. Please try again with valid details."), div(class = "actions", div(class = "ui button", id = "closeBadRequest", "OK")) ) Slightly adjust the location reactive to incorporate the modal. The commented-out code will be replaced with the JavaScript lines. The runjs function shows the modal depending on the error encountered. req(FALSE) terminates the reactive flow. # show and hide modals - Server location <- reactive({ query <- input$location if(openstreetmap_error_body(query, api_key) == "404"){ #validate("No such city/town exists. Check your spelling!") runjs("$('#notFound').modal('show');") req(FALSE) } else if(openstreetmap_error_body(query, api_key) == "400"){ #validate("Bad request") runjs("$('#badRequest').modal('show');") req(FALSE) } coords <- geocode(query, api_key) }) %>% bindEvent(input$search) # listens for button click on modals to hide modal observeEvent(input$closeNotFound, { runjs("$('#notFound').modal('hide');") }) observeEvent(input$closeBadRequest, { runjs("$('#badRequest').modal('hide');") }) Conclusion In this tutorial, you have built a weather app using Shiny that retrieves weather data from an API and displays it in an interactive and visually appealing way. To do this, you used the following libraries: httr2 for making API requests and handling responses shiny.semantic for styling the app lubridate for working with and formatting time data shinyjs for integrating JavaScript features into the app This combination of tools allowed you to create a functional, user-friendly weather app. You can find the complete code for the project here. La Fin! ...

09 December 2024hashnode

How to Run R Programs Directly in Jupyter Notebook Locally

How to Run R Programs Directly in Jupyter Notebook Locally

R is a popular programming language that’s now widely used in research-related fields like Bioinformatics. And to use R, you’ll need to install the R Compiler and R Studio. But did you know that you can also directly run your R code right in a Jupyter Notebook? This helps in so many ways if you are already used to using Jupyter Notebook for Machine Learning-related tasks using Python. In this tutorial, I’ll show you exactly how you can set up your local machine to run the R programming language directly in Jupyter Notebook. The processes I am going to show you today are equally applicable to all major operating systems (Windows, MacOS, and Linux OSes). Table of Contents Install Conda Create a New Environment Activate Your Conda Environment Install ipykernel and jupyter Install R in the Conda Environment Open the Jupyter Notebook Run R in Jupyter Notebook Conclusion Install Conda You’d normally use Conda to handle multiple environments in Python. And here, we’re going to use the same Conda program to install R in our environment. You can either use Anaconda or Miniconda. I prefer Miniconda as it’s so lightweight. You’ll also get the opportunity to install the latest packages directly using Miniconda. But you can simply go with the Anaconda if you are already comfortable with that. Create a New Environment Many people tend to use the Base environment. But I never like to use the Base environment directly as you typically need multiple environments for handling different package and versions of packages as well. So I’ll create a new environment where I’ll work on my R programming language-related tasks using Jupyter Notebook. To create a new Conda environment, simply use the following command: conda create --name r-conda Here, r-conda is my Conda environment’s name. You can choose any other name, but keep in mind that the conda env name can not have any whitespaces in it. It will create a new Conda environment named r-conda for me. Activate Your Conda Environment If you want to work on a separate conda environment, you’ll need to make sure that you’re activating that specific conda environment before starting to do anything. I want to work on the r-conda conda environment. So I can simply activate the conda environment using the following command: conda activate r-conda You need to use the exact conda env name that you want if it’s different than r-conda in the command. 💡 Keep in mind that you need to activate the conda environment successfully before proceeding further. You will see the conda environment’s name as (conda-env-name) at the left side of your terminal. Install ipykernel and jupyter I always like to install the ipykernel and jupyter in all of my conda environments as they help manage different conda environments’ Jupyter notebooks/labs separately. So I’m going to install them together in my conda env by using the command below: conda install ipykernel jupyter This will install both ipykernel and jupyter in the activated conda environment. Install R in the Conda Environment To install R directly in the conda environment, simply use the following command: conda install -c r r-irkernel This will install the necessary components that enable your local computer to run the R program in your Jupyter Notebook. Open the Jupyter Notebook Now you can open the Jupyter Notebook either by using jupyter notebook or jupyter notebook --ip=0.0.0.0 --port=8889 --no-browser --allow-root --NotebookApp.token=''. Just make sure to modify the IP, port, root configuration, and token as you see fit for your work. Open the given link in the terminal to open Jupyter Notebook in your web browser. Run R in Jupyter Notebook After opening Jupyter Notebook in your web browser, when you want to create a new notebook for R, you will get R directly in the “New” menu like the image given below. Now, you can use the R language directly in your Jupyter Notebook! You can also see the R programming language logo at the top right side of your Notebook. Conclusion Thank you for reading the entire article. I hope you have learned something new here. If you have enjoyed the procedures step-by-step, then don't forget to let me know on Twitter/X or LinkedIn. I would appreciate it if you could endorse me for some relevant skillsets on LinkedIn. I would also recommend that you subscribe to my YouTube channel for regular programming-related content. You can follow me on GitHub as well if you are interested in open source. Make sure to check my website as well. Thank you so much! 😀 ...

03 October 2024hashnode

The Role of Data Analysis in Enhancing Engineering Productivity

The Role of Data Analysis in Enhancing Engineering Productivity

Introduction - Moving from Guesswork to Greatness with Data Have you ever wondered why some engineering teams are on a rocket ship to success while others are stuck in the mud? Spoiler alert: it’s not about working harder but working smarter. The game changer? Data analysis. A powerful data analysis tool can transform your engineering productivity from "meh" to "wow" faster than you can say "big data." According to a study by McKinsey, companies that leverage data analysis see a productivity boost of up to 25%. That’s right, just by harnessing the power of data, you can squeeze out an extra quarter of awesomeness from your engineering team. Imagine telling your boss, "Yeah, we increased our output by 25%. You’d be the office superhero, and your colleagues might think you have some Iron Man-level superpower shit. Why Is Data Crucial for Boosting Engineering Productivity? Data analysis in engineering is like having a savvy friend who knows all the best shortcuts—saving you time, effort, and possibly a few headaches. It’s all about collecting, processing, and decoding the massive amounts of data generated during your engineering adventures. Picture it as your GPS in a city packed with dead ends. Without it, you’re basically wandering around, hoping to stumble upon success by sheer luck. So why is data such a game-changer for engineering productivity? Let’s break it down: Informed Decision-Making: Making Choices Like a Pro, No Guess Works Please 😥 Imagine you’re trying to decide whether to upgrade your coffee maker or just stick with the old one. Relying on gut feeling might leave you jittery with regret. But if you look at data—like how often your current coffee maker breaks down or how much better the new one brews—you’re making a decision based on facts, not feelings. That’s exactly how data works for engineers. Instead of flying blind, they use data to chart their course. They dive into past performance metrics, keep an eye on current progress, and use predictive analytics to forecast what’s coming next. It’s like having a detailed map and a GPS for your engineering projects—making sure you’re not just guessing your way to success but navigating with precision. Also read: Leveraging Data-Driven Decision Making in Engineering Management Identifying Bottlenecks: Because Even Your Workflow Deserves a Smooth Ride! Ever feel like your workflow is a clunky old car stuck in a traffic jam? It’s time to upgrade to a smooth ride with data analysis. By pinpointing where those pesky slowdowns are happening, you can streamline your processes and get things moving at full speed. What’s a Bottleneck, Anyway? Think of a bottleneck as the exit ramp where traffic slows to a crawl. In engineering, it’s the part of your workflow that causes delays and inefficiencies. Whether it’s a sluggish code review or a lagging deployment process, identifying and addressing these choke points is crucial for keeping everything running smoothly. How Data Analysis Comes to the Rescue Spotting the Slowpokes: Data analysis helps you figure out exactly where things are getting stuck. Imagine you’re running a relay race, and one team member is consistently lagging behind. Data can show you which stage of your workflow is causing delays—like whether it’s the code commits or the testing phase. Fixing the Jams: Once you know where the trouble is, you can tackle it head-on. If your data reveals that reviews are taking forever because reviewers are overloaded, you can reassign tasks or streamline the process to get things moving faster. Streamlining Your Route: With the right insights, you can make adjustments that keep everything flowing smoothly. For instance, if bugs are slowing you down, improving your testing procedures or adding automated tests can help you avoid those annoying slowdowns. Real-Life Application Picture this: A software team juggling multiple projects. Data analysis might uncover those certain stages of their workflow that consistently slow things down. By optimizing these stages, like speeding up code reviews or automating routine tasks, the team can get updates out faster. So, next time you find your workflow bogged down, remember with data analysis, you’ve got the perfect tool to clear the traffic and get everything cruising at top speed! Also read: How to Leverage Sprint Retrospectives to Drive Software Team’s Growth: A Data-Driven Guide for Technical Managers Predictive Analysis: Your Crystal Ball for Avoiding Disasters (No Magic Required)! Imagine if you had a superpower that let you see trouble coming from a mile away. Sounds like a dream, right? Well, with data analysis, you can practically have that power! By digging into past data, you can spot potential problems before they explode into full-blown fiascos. It’s like having a crystal ball that’s all about actionable insights instead of cryptic prophecies. Why Predicting Problems is Better Than Playing Catch-Up Think of predictive analytics as your personal fortune teller. It scans your data like a detective on a mystery case, uncovering patterns that hint at future issues. For instance, if your data shows that certain types of bugs are recurring frequently, you can tackle these issues proactively rather than scrambling to fix them once they’ve already caused chaos. The Benefits of Predictive Analysis in Real Life Project Management Magic: According to a Forrester survey, companies that use predictive analytics cut project overruns by 20%. That’s like getting a backstage pass to a smoother project experience. Instead of dealing with last-minute scrambles, you stay on track and keep everything under budget. Avoiding Headaches: Imagine you’re throwing a big event and predicting rain because you noticed storm patterns in past data. You’d prepare with tents and umbrellas ahead of time, avoiding soggy guests and ruined plans. Similarly, in engineering, predicting potential issues helps you prepare and dodge those major headaches. Staying Ahead of the Game: Predictive analytics is like having a sneak peek into your project's future, helping you spot problems before they even think about sneaking up on you. For instance, imagine your engineering team notices that code reviews are frequently causing delays because of complex merges. With predictive analytics, you can identify this pattern early on and address it—maybe by simplifying the merge process or adding automated checks—before it snowballs into a major bottleneck. This proactive approach keeps your development process smooth and avoids those dreaded last-minute scrambles. Improving Collaboration and Communication: No More “he said, she said”! Tired of important details getting lost in translation? Miscommunication can be a real game-changer—just not in the way you want. Data analysis swoops in to save the day by giving your team a single source of truth. With everyone on the same page, you can kiss those “he said, she said” moments goodbye. It’s all about clear, data-driven communication. Why Clear Communication Matters Imagine your software engineering team is working on a major project. Without a unified data system, one developer might be working on a feature based on outdated specs, while another would be fixing bugs that no one knew were fixed. Chaos, right? Data-driven tools ensure that everyone has access to the same up-to-date information. This means when the lead developer says, “We need to push this feature live,” everyone knows exactly what that means—no more confusion about whether it’s a lunch break or a product launch. So, with data analysis as your secret weapon, you can turn potential communication mix-ups into streamlined success and ensure that your projects launch smoothly—no accidental lunches involved! Enhancing Decision-Making: Because Gut Feelings and Horoscopes Aren’t the Best Project Managers! 😌 Remember when decisions were made based on gut feelings and the alignment of the stars? Ah, the good old days. But now, we’ve moved beyond star charts and hunches. With data analysis on your side, decisions are grounded in solid evidence rather than just intuition. This means more reliable outcomes and a higher success rate for your engineering projects. Plus, it's a lot easier to defend your choices with data than with vague horoscopes. Why Data Beats Gut Feelings Imagine your software engineering team is debating whether to prioritize a new feature or focus on bug fixes. Back in the day, you might have gone with your gut—or consulted your daily horoscope 😛. But with data analysis, you can look at metrics like user feedback, bug reports, and feature adoption rates to make a well-informed choice. Instead of saying, “I have a feeling this feature will be a hit,” you can present evidence-backed reasons, such as, “Our data shows a 40% increase in user interest for this feature, so let’s prioritize it.” Let’s say your team is considering whether to invest time in optimizing an existing feature or developing a new one. With access to data, you can analyze performance metrics, user engagement, and historical data to guide your decision. For instance, if data shows that the existing feature is causing frequent issues and impacting user satisfaction, it makes sense to focus on optimization. According to the International Institute for Analytics, companies that rely on data-driven decision-making are three times more likely to see substantial improvements in their outcomes. This means when you walk into that meeting with data in hand, you can confidently ditch the crystal ball and make decisions backed by solid evidence. So, leave the horoscopes and gut feelings for personal life—when it comes to engineering decisions, data is your new best friend. The Fun Part: Tools and Technologies – Because Who Doesn’t Want to Play with Data Gadgets? All right, so you’re all in on data analysis. But where do you start in this vast universe of tools? Think of it as being a kid in a candy store, except instead of sugary treats, you’re surrounded by productivity-boosting data insights. From heavyweights like Python and R to nifty tools like Tableau and Power BI, there’s a gadget for every need and budget. Must-have Data Analysis Tool Kit for Engineering Team Python & R These are the Swiss Army knives of data analysis. Python’s great for everything from crunching numbers to building complex algorithms, while R shines in statistical analysis and visualizations. Using these languages can make you feel like a data wizard, casting spells of insight with every line of code. Tableau & Power BI These tools are your paintbrushes for those who love visualizing data like an artist. They turn raw data into stunning dashboards and interactive charts. According to a survey by Gartner, teams using Tableau for data visualization reported a 28% increase in productivity. Power BI isn’t far behind, helping teams dive into their data with ease. Middleware Middleware’s DORA metrics dashboard —your ultimate toolkit for optimizing software delivery. With features like detailed PR insights, and even automated JIRA sprint reports, you can keep your engineering processes smooth and efficient. Imagine having a GPS for your development pipeline, showing you the fastest route to top performance. Teams leveraging DORA metrics see significant boosts in productivity and delivery speed, making it an invaluable addition to your data arsenal. Picture your software engineering team gearing up for a big release. With Python, you can script complex analyses to identify potential bottlenecks. Tableau helps you visualize team performance and track progress with eye-catching dashboards. And with Middleware’s DORA metrics, you can get actionable insights into your development pipeline, ensuring everything runs like a well-oiled machine. So, dive into this candy store of data tools and discover how each one can sweeten your workflow. Whether you’re coding in Python, designing in Tableau, or optimizing with DORA metrics, you’ll find plenty of ways to boost productivity and make data your new best friend. How to Effectively Collect and Analyze Engineering Data? Step 1: Set Clear Objectives Before diving into the ocean of data, you need a solid plan. Ask yourself: What problem are you trying to solve? What questions do you need to answer? What metrics will help you measure success? With clear objectives, you'll know exactly what data to collect and why. Step 2: Identify Relevant Data Sources Your data sources will depend on your objectives. Common sources include: Project Management Tools: Jira, Trello, Asana Version Control Systems: GitHub, GitLab, Bitbucket Continuous Integration/Continuous Deployment (CI/CD) Tools: Jenkins, CircleCI, Travis CI Communication Tools: Slack, Microsoft Teams These tools capture a lot of valuable information about your engineering processes. Step 3: Automate Data Collection Manually collecting data is a time sink. Use automated tools and scripts to gather data from your sources. Tools like Datadog, New Relic, or even custom scripts can pull data from APIs and store it in a centralized location. Step 4: Clean and Streamlined Data Raw data can be messy and inconsistent. Clean and streamline your data to ensure consistency. This involves: Removing duplicates Handling missing values Standardizing formats (e.g., dates, units) Clean data ensures accurate analysis. Step 5: Choose the Right Tools for Analysis There are many tools available for analyzing engineering data. Popular ones include: Data Visualization Tools: Tableau, Power BI, Grafana Statistical Analysis Tools: R, Python (with libraries like Pandas and NumPy) Big Data Tools: Hadoop, Spark Select tools that fit your team's skills and your analysis needs. Step 6: Analyze and Interpret Data With clean data and the right tools, you can start analyzing. Look for patterns, trends, and insights that align with your objectives. Some key metrics to consider include: Cycle Time: The time it takes to complete a task from start to finish Lead Time: The time from when a feature is requested to when it's delivered Deployment Frequency: How often you deploy code to production Mean Time to Recovery (MTTR): The average time to recover from a failure Interpret the results in the context of your objectives. What are the key takeaways? What actions can you take to improve? Step 7: Visualize and Communicate Findings Data without context is just noise. Use visualizations to tell a story with your data. Dashboards, charts, and graphs can make complex data more understandable. Tools like Power BI or Grafana can help you create interactive dashboards. Step 8: Implement Data-Driven Decisions Use the insights from your analysis to drive decisions. Whether it’s optimizing your development process, improving team performance, or enhancing product quality, data-driven decisions are more likely to yield positive results. Step 9: Continuously Monitor and Iterate Data collection and analysis is an ongoing process. Continuously monitor your metrics and iterate on your processes. Regularly review your objectives and adjust your data collection and analysis methods as needed. Also read: How to Build and Lead High-Performing Engineering Teams? Wrapping It Up: Data Analysis – The Key to Engineering Efficiency In a world where every decision counts and efficiency is key, data analysis stands out as a game-changer for engineering productivity. It’s not just about jumping on the latest trend; it’s about transforming how you approach problems, make decisions, and streamline processes. From predicting potential issues before they snowball into chaos to enhancing collaboration with clear, data-driven communication, the benefits are undeniable. But to truly harness the power of data, you need the right data analysis tools. That’s where Middleware’s DORA metrics come in. With its advanced features for optimizing software delivery, detailed PR insights, and team performance metrics, Middleware provides the ultimate solution for maximizing productivity and ensuring your projects stay on track. So, if you’re ready to revolutionize your engineering processes and drive unparalleled efficiency, dive into the world of data analysis with Middleware’s DORA metrics. Let’s turn those data dreams into actionable success and keep your projects cruising smoothly from start to finish. FAQs What types of data are important for engineering productivity? Key data types include project timelines, task completion rates, code quality metrics, bug reports, team performance metrics, and resource utilization. Analyzing these data points helps in understanding how efficiently projects are progressing and where improvements can be made. What are some common data analysis tools in engineering? Common data analysis tools in engineering include Jira for project management, Git and GitHub for version control, Tableau and Power BI for visualization, Python and R for statistical analysis, and Excel for basic manipulation. Middleware's DORA metrics also offer valuable insights into software delivery performance and team efficiency. What are some challenges in using data analysis for engineering productivity? Challenges include ensuring data accuracy, integrating data from multiple sources, and avoiding data overload. It's also important to have the right skills and tools to interpret data effectively. Addressing these challenges requires a strategic approach and sometimes specialized training. Why is data analysis important in engineering? Engineering data analysis boosts performance by optimizing processes, guiding decisions with objective insights, and predicting issues before they arise, all while improving resource allocation and quality. ...

15 August 2024hashnode

Tkinter project: Simple Interest calculator

Tkinter project: Simple Interest calculator

Introduction This project is a Simple Interest Calculator built using Python’s Tkinter...

09 May 2025devto