Department of Random Technical Stuffs: February 2024

Tuesday, February 27, 2024

Fed 27 2024. A Day of Compassion and Unexpected Encounters

Today was not an ordinary day for me; it began with the need to take an unexpected sick day. The morning unfolded with urgency as I planned to drop off my child, but fate had a different story in store. Around 9:30 AM, near Hopefarm in Whitefield, my day took a dramatic turn. As I attempted to overtake a car on the right, an elderly man crossing the road appeared in my path. Despite his slow pace and my high speed, I managed to brake just in time, reducing the impact of the collision. The man, although not severely injured, fell and sustained minor scratches on his knees.

The incident drew immediate attention, and several passersby stopped to assist. My initial fear of confrontation was alleviated by my attempts to communicate in broken Kannada, which, to my relief, helped manage the situation. With my hands trembling, I turned to my wife and daughter, who were with me in the car, for a water bottle to tend to the injured man. After cleaning his face, we hurried him to Manipal Hospital for medical attention.

Remarkably, the man expressed his gratitude throughout the ordeal. Upon arrival at the hospital, initial examinations showed his heart rate was normal, but what shocked me was learning about his age, 75, and that he had two pacemakers. His resilience was notable, yet the concern for potential injuries due to his age was palpable. Fortunately, after further assessments and an X-ray, we were relieved to hear that there were no fractures, only a knee injury.

Throughout this time, the man remained conscious and trusting. His ability to call his son-in-law and wife brought a sense of comfort to him, and I did my best to provide reassurance. As we awaited the X-ray results, my brother joined me, offering his support. The news that there were no fractures brought a collective sigh of relief.

The atmosphere grew tense with the arrival of his wife and son-in-law, bracing themselves for the worst, fearing the accident had left him in a dire state. However, their anxiety was soon alleviated when they realized that, thanks to the prompt and compassionate actions taken, he was in stable condition. The moment served as a poignant reminder of the fragile nature of life and the importance of kindness in times of crisis.

As conversations unfolded, the wife began to share more about their family's journey, particularly focusing on their son's challenges. Their son, once a manager at IBM, had taken a bold leap by moving to Singapore for work before returning to Bangalore. His career transition from the corporate world to starting his own real estate business marked a period of prosperity for the family. However, life's unpredictable nature struck hard when he suffered a stroke, an event that turned their lives upside down.

At just 45 years old, their son faced a daunting battle for recovery. The medical expenses were astronomical, draining nearly 45 to 50 lakhs from their savings, with hospital bills averaging around 90,000 per day over two months. Despite the financial strain, the progress he made was heartening; from being bedridden to sitting up, each small victory was celebrated. Yet, the emotional and financial toll of the situation was evident as his mother recounted their story, her eyes brimming with tears.

The narrative of their son's struggle and resilience added a layer of depth to the day's events, transforming a simple act of assistance into a profound connection between two families. The gratitude expressed by the elderly man's wife, punctuated by her blessing, "God Bless you and your family," resonated deeply, leaving an indelible mark on the heart.

In the end, as I ensured their safe return home and bid them farewell, the promise to keep their loved one safe from harm was not just a commitment to them but a reminder of the interconnectedness of our lives. The story of their son, a testament to human resilience in the face of adversity, underscored the day's experiences, enriching the fabric of this unexpected encounter with strands of hope, perseverance, and shared humanity.

Thursday, February 22, 2024

Jump Game

Jump Game Peak and Valley approach.

Jump game is a medium level leetcode problem which is very interesting yet brainy. Once you understand the problem correctly then the answer is obvious. The probelm goes like this according to leetcode.

You are given an integer array nums. You are initially positioned at the array's first index, and each element in the array represents your maximum jump length at that position.

Return true if you can reach the last index, or false otherwise.

For example consider this array

[1,3,2,2,0,1]

When I am at position 1 then i can jump to position 2, Once i am at postition 2 I can jump to 3,4 or 5. The max i can jump from Second position is to position 5 and the element at position 5 is 0 this essentialy means i cannot jump further.

The way to solve this problem using peak and valley here I am representing this array as peaks and valleys.

If you look at the above image you can see peak and valleys. Once you are on the valley that is 0 You don't really have a way out of it. But you can actually make a jump from position 3 to position 5 by skipping the valley. This is the basic intuition behind the solution.

The basic steps are the following :

You create a variable called reachable which holds the maximum position you can reach. So how do you calculate the value of reachable it is :

Max(reachable, currentPosition+input[currentPosition]).

So the max reachable index will be the maximum of reachable or currentPosition+input[currentPosition].

So our job is just to calculate this every position and if at any given point we realize that the currentPosition is greater than the reachable then we reuturn false, essentially means we are at the valley and no way to escape.

private boolean canJumpFromPosition(int i, int[] nums) {
    int reachable=0;
    int n=nums.length;
    for(int j=0;j<n;j++){
        if(j>reachable){
            return false;
        }else{
            reachable=Math.max(reachable,j+nums[j]);
        }

    }
    return true;
}

In this we iterate over the arrray and each position we check are we at a reachable position or not. If we are not at a reachable position then we are at valley and if we are at a reachable position then we just recalculate the reachable index by adding currentIndex and the value in the current index.

Thursday, February 15, 2024

From Painful Tables to Performance Bliss: My Journey with Database Partitioning - Part II

Skewed workloads and Relieving Hot Spots

Imagine you have a library with books categorized by their first letter (A-Z). This is like partitioning data based on a key (like the first letter of a book title).

Problem: One letter (say, "X") becomes super popular (a celebrity author!). Everyone wants to read "X" books, causing a "hot spot" (overcrowding) in the "X" section.

Hashing doesn't fix it: Even if you assign different "buckets" based on a hash of the title, all "X" books still end up in the "X" bucket.

Solution 1: Split the key: Add a random number to each "X" book title (e.g., "X123", "X456"). This spreads them across different buckets, reducing the crowd in "X".

Drawback: Now you need to check all buckets with "X" to find all the books (more work for reading).

Solution 2 (future): Imagine the library magically adjusts shelves based on popularity, automatically spreading out the "X" books.

Takeaway: Choose the solution that best fits your needs. Splitting keys helps with hot spots but adds complexity. Future systems might handle this automatically.

Partitioning and Secondary Indexes

A secondary index usually doesn't identify a record uniquely but rather is a way of searching for occurences of a particular value; find all action by user 123, find all articles containing the word hogwash, find all cars whose color is red. In these statements you can see we use a secondary index for our query.

This is essential for any database design. The issue with secondary indexes are they don't map neatly to partitions. This happens due to non uniqueness, for example find all action by user means find all actions| filter by user, so the find all will go search on all partition. This is a simple example for this problem.

There are two main approaches to partitioning a database with secondary indexes : document-based partitioning and term based partitioning.

Partitioning Secondary Indexes by Document.

If you look at this image you can see within each partition there is another secondary index created. So if you want to search for red cars, you need to send the query to all partitions, and combine all the results you get back. In this approach you can notice each partition creates its own secondary index and when you are writing to it you only need to deal with the partition that contains the document ID that you are writing. For this exact reason a Document-Partitioned index is also known as local index.

Querying this kind of partitioned database is known as scatter/gather, and it can make read queries on secondary indexe quite expensive. Even if you query the partition in parallel scatter/gather is prone to tail latency amplifications. Nevertheless it is a widely used approach.

On the next blog we will learn more about partitioining from Martin Klepmann's "Designing Data intensive Applications

Sunday, February 11, 2024

From Painful Tables to Performance Bliss: My Journey with Database Partitioning - Part I

Ah, the early days of wrangling massive data tables! I vividly remember the struggle – slow queries, performance bottlenecks, and the ever-growing cloud bill. It was an uphill battle until we unearthed the magic bullet: database partitioning. Talk about a revelation! This newfound approach not only eradicated performance issues but also slashed computational costs.

But the story doesn't end there. My exploration revealed a treasure trove of partitioning techniques, each unlocking unique advantages. Inspired by "Designing Data-Intensive Applications", I embarked on a quest to master this data management superpower. This blog chronicles my learnings, shedding light on the various ways you can partition your data for optimal performance and cost-efficiency.

Every partition is a small database of its own. Each piece of data will belong to one partition. The main reason for having this small databases or we call partition is for scalability. Different partition can be placed in different nodes in a shared-nothing cluster(A cluster of nodes where nodes will be independent).

Assume there is a query to fetch a row, the query will be performed by a node on its on partition. So to increase throughput just add more nodes.

Partitioning and Replication

Building on the previous statement, if every node has a partition, it implies that each node holds copies of all partitions. This ensures that even though each record belongs to a specific partition, it might still be present on multiple nodes for the sake of fault tolerance.

Partitioning by Key Range

Real-World Example: Imagine millions of IoT devices, each with a unique IMEI number and timestamps for sensor readings. We can leverage key-range partitioning by using the date as the key. However, a potential drawback arises: all data within a single day would reside in the same partition, leaving others idle.

Dual Partitioning: To address this and distribute the write load more evenly, we can introduce dual partitioning. We'll use both the date and the IMEI number as keys. This ensures data gets distributed across multiple partitions based on both date and device, preventing overloading of single partitions.

Querying: When fetching data from multiple sensors within a specific time range, separate range queries will be needed for each IMEI number. However, the overall performance gain often outweighs this drawback due to the efficient retrieval within each partition.

Personal Experience: This concept played a pivotal role in my career, saving the company millions. By implementing dual partitioning for range-based queries on massive IoT sensor data, we significantly improved performance and optimized resource utilization. This experience solidified the importance of understanding core concepts before designing any system.

Partitioning by Hash of Key

This is another very interesting partitioning concept. In contrast to key-range partitioning, we use a hash function to determine the partition of a given key. A good hash function takes skewed data and makes it uniformly distributed. Imagine you have a 32-bit hash function that takes a string. When you provide a new string, it returns a seemingly random number between 0 and (2 to the power of 32) - 1. Even if the input strings are very similar, their hashes are evenly distributed across that range of numbers.

Cassandra and Mongo uses MD5

Voldermort uses the Fowoler-Noll-Vo function.

Built-in hash functions of programming languages may not be suitable for data partitioning. For example, Java's hashCode() method can generate different hash values for the same key in different processes. In such cases, it's recommended to use separate hash implementations specifically designed for consistent key distribution across partitions. The figure below shows how partitioning by a well-chosen hash function actually works.

This is indeed a great technique for evenly distributing keys across partitions. The partition boundaries can be equally spaced or chosen pseudorandomly.

Tuesday, February 6, 2024

Event Sourcing - Moving out of traditions | Simplified version

Introduction

A colleague recommended Martin Kleppmann's "Designing Data-Intensive Applications" to me. Initially, I found the beginning somewhat tedious and opted for a non-linear approach, selecting topics of interest at random rather than reading from start to finish as one might with a novel. This strategy seemed fitting given the book's comprehensive coverage of software system design, akin to an engineering bible. Today, I've chosen to delve into the concept of Event Sourcing. Let's explore this topic together.

Why Event Sourcing?

Relevance: It's a critical concept in building scalable, resilient distributed systems and is widely applicable in modern software architectures, including microservices.
Foundational Knowledge: Understanding event sourcing will deepen your knowledge of how large-scale systems manage state and handle data changes over time.
Practical Application: It's highly relevant to real-world systems, particularly in scenarios requiring audit trails, historical data analysis, or complex business transactions.

Lets see an Order table schema below which is written on the traditional approach.

Traditional Approach schema

Here :

order_id is a unique identifier for each order.
customer_id links to a customers table (not shown) that contains customer details.
order_date is the date and time the order was placed.
total_amount is the total cost of the order.
status could be values like 'Placed', 'Paid', 'Shipped', 'Delivered', etc.

Simple right..?

Obviously you have seen this kind of schema in your career. 100% , All of us seen this.

Event Sourcing Approach Schema

And a simplified orders table to store order IDs

You are starting to get what is event sourcing is now. But still doubtfull right, well everyone is will be doubtful until you see some real data. So lets insert some dummy data into the traditional table and to the event sourcing table.

This will create four orders with different statuses in the orders table.

Now the event sourcing data.

In this example

Order 1 goes through the full lifecycle from being placed to delivery.
Order 2 is placed and paid for but hasn't been shipped yet.
Order 3 is placed, paid for, and shipped, but not yet delivered.
Order 4 is only placed.

Event sourcing is much more descriptive when it comes to the historical events that have happened to an order. Imagine a bug occurred during the order placing flow; we can always replay this particular event and debug the system, instead of permanently losing that flow and saying "not reproducible," then waiting for the next order placement to occur. You know, the infamous "not reproducible" bug.

The ideal places to apply this?

Considering the earlier discussion about the "not reproducible" bug, let's contemplate its implications in more critical scenarios. Would a bank ever be content with labeling a transactional glitch as "not reproducible" and leaving it at that?

Or imagine placing an order on Amazon, only to encounter an error during packaging that erroneously signals a payment problem. Would such explanations be satisfactory?

Event sourcing is well-suited for systems that undergo a series of discrete events, each affecting the system's state.

This is just an introducton :

Please watch/read the following videos to understand more

https://www.youtube.com/watch?v=MA_3fPBFBtg. About linkedins uses of kafka

Event sourcing from martin kleppmann designing data-intensive applications

So that's 10 Minutes in the morning!

Monday, February 5, 2024

Unlocking the Hidden Patterns: The Wonders of Binomial Coefficients and Pascal's Triangle

Understanding the Binomial Coefficient:

Imagine you have a group of 5 friends (let's name them A, B, C, D, and E), and you want to select 3 of them to form a team for a game. The binomial coefficient $(\binom{5}{3})$ tells you how many different teams of 3 you can form from the 5 friends.

Breaking Down the Formula:

The formula for the binomial coefficient is $(\binom{�}{�}) = \frac{�!}{�! (� - �)!}$ . Let's break it down using our example where $� = 5$ and $� = 3$ :

1. Factorials: The "!" symbol stands for factorial. For any positive integer $�$ , $�!$ (x factorial) is the product of all positive integers up to $�$ . For example, $5! = 5 \times 4 \times 3 \times 2 \times 1 = 120$ .
2. Applying the Formula: To find $(\binom{5}{3})$ , we plug our numbers into the formula:
$(\binom{5}{3}) = \frac{5!}{3! (5 - 3)!} = \frac{5!}{3! 2!}$
Calculating each factorial:
- $5! = 120$
- $3! = 6$
- $2! = 2$
So, $(\binom{5}{3}) = \frac{120}{6 \times 2} = \frac{120}{12} = 10$ .
This means there are 10 different ways to choose 3 friends out of 5.
Why Does the Formula Work?
1. Starting with Permutations: Initially, if we were to line up 3 of your friends out of 5, not just choose them, we'd have to consider the permutations (different arrangements). For the first friend, you have 5 choices, for the second, you have 4 (since one friend is already chosen), and for the third, 3 choices left. So, for permutations, you multiply these:

$5 \times 4 \times 3 = 60$ .
2. Adjusting for Order Not Mattering: The 60 permutations include different orders of the same 3 friends (like ABC, ACB, BAC...). But in our case, the order doesn't matter; ABC is the same team as BAC. Each group of 3 friends can be arranged in $3! = 6$ ways (3 options for the first position, 2 for the second, and 1 for the last), so we've counted each group 6 times in our 60 permutations.
3. Getting to Combinations: To correct for this and count each unique group once, we divide the number of permutations by the number of ways to arrange 3 friends:
$\frac{60}{6} = 10$ . This matches our calculation using the binomial coefficient formula, showing that there are 10 unique ways to choose 3 friends out of 5.
Conclusion:

The binomial coefficient formula $(\binom{�}{�})$ gives us a way to calculate the number of unique groups (combinations) of $�$ items we can choose from a larger set of $�$ items, without caring about the order of selection. It corrects for the overcounting of arrangements (permutations) by dividing by the factorial of the group size $�!$ , ensuring each unique group is only counted once.

Now lets gets into the most interesting part. The pascals triangle and creating it with Binomial Coefficients.

First lets start with writing a small program to print pascals matrix to the nth row.

Idea behind this program is pretty simple List<List> basically represents Column<Rows> or a matrix.

Each rows first and last value will be one that we hard code. Then iterate over numRows and construct each rows. One small thing to notice here is the inner for loop it has to run from 1 to < less than rowNum. Basically it runs 1, 2, 3 times from the 2nd row. So basically rowNum-1 times it runs, basically skipping for first two rows.

Lets move forward to to understand the relation between binomial coefficient and the pascals triangle. If you look closely you can find something really interesting. Below is the representation of of pascals triangle

If you did not catch this yet see the below which will unlock the hidden pattern of the relation between binomial coefficient and pascals triangle

with this we now understood the relation between pascals triangle and binomial coefficient. Great now for a very dumb person like me can easily understand what Striver is saying in his video. Now the below program makes a lot of sense. Check the below program which will show you how to calculate the value in a particular position of the pascals triangle.

If you reached this far of the blog then great you are upto something special in your life. Stay tuned for the next blog

Department of Random Technical Stuffs