CodeCutTech
diff --git a/‎Chapter1/dictionary.ipynb‎
Lines changed: 80 additions & 51 deletions b/‎Chapter1/dictionary.ipynb‎
Lines changed: 80 additions & 51 deletions
diff --git a/‎Chapter5/spark.ipynb‎
Lines changed: 193 additions & 0 deletions b/‎Chapter5/spark.ipynb‎
Lines changed: 193 additions & 0 deletions
diff --git a/‎docs/Chapter1/Chapter1.html‎
Lines changed: 18 additions & 1 deletion b/‎docs/Chapter1/Chapter1.html‎
Lines changed: 18 additions & 1 deletion
@@ -175,83 +175,112 @@
    "id": "5caf1b3f",
    "metadata": {},
    "source": [
-    "### dict.get: Get the Default Value of a Dictionary if a Key Doesn't Exist"
+    "### Stop Writing Nested if-else: Use Python's .get() Instead"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "c2dd1725",
+   "id": "955af7b3-81c9-44d7-8b96-3113d7b3b199",
    "metadata": {},
    "source": [
-    "If you want to get the default value when a key doesn't exist in a dictionary, use `dict.get`. In the code below, since there is no key `meeting3`, the default value `online` is returned. "
+    "When working with dictionaries in Python, it's common to encounter situations where you need to access values that may or may not exist. The traditional approach of using multiple nested if-else statements can result in verbose, repetitive code that's harder to maintain and more prone to errors.\n",
+    "\n",
+    "Let's consider an example where we have a dictionary `user_data` with keys \"name\", \"age\", and possibly \"email\". We want to assign default values to these keys if they don't exist."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 5,
-   "id": "e066cd73",
-   "metadata": {
-    "ExecuteTime": {
-     "end_time": "2021-09-30T12:49:56.062960Z",
-     "start_time": "2021-09-30T12:49:56.056983Z"
+   "execution_count": 2,
+   "id": "88422abc-4db3-42fb-a655-938c2b1db0c4",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "name='Alice'\n",
+      "age=30\n",
+      "email='no-email@example.com'\n"
+     ]
     }
-   },
-   "outputs": [],
+   ],
    "source": [
-    "locations = {'meeting1': 'room1', 'meeting2': 'room2'}"
+    "# Checking dictionary values with multiple if-else\n",
+    "user_data = {\"name\": \"Alice\", \"age\": 30}\n",
+    "\n",
+    "# Repetitive code with multiple default values\n",
+    "if \"name\" in user_data:\n",
+    "    name = user_data[\"name\"]\n",
+    "else:\n",
+    "    name = \"Unknown\"\n",
+    "    \n",
+    "if \"age\" in user_data:\n",
+    "    age = user_data[\"age\"]\n",
+    "else:\n",
+    "    age = 0\n",
+    "    \n",
+    "if \"email\" in user_data:\n",
+    "    email = user_data[\"email\"]\n",
+    "else:\n",
+    "    email = \"no-email@example.com\"\n",
+    "\n",
+    "\n",
+    "print(f\"{name=}\")\n",
+    "print(f\"{age=}\")\n",
+    "print(f\"{email=}\")"
    ]
   },
   {
-   "cell_type": "code",
-   "execution_count": 6,
-   "id": "d07fcf3e",
-   "metadata": {
-    "ExecuteTime": {
-     "end_time": "2021-09-30T12:49:59.099210Z",
-     "start_time": "2021-09-30T12:49:59.090362Z"
-    },
-    "scrolled": true
-   },
-   "outputs": [
-    {
-     "data": {
-      "text/plain": [
-       "'room1'"
-      ]
-     },
-     "execution_count": 6,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
+   "cell_type": "markdown",
+   "id": "21613f3c-afca-4d32-a158-8efd70503675",
+   "metadata": {},
+   "source": [
+    "As you can see, this approach is tedious and prone to errors. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ea9eef2a-cd12-4439-82b7-cbfe84ccc42d",
+   "metadata": {},
    "source": [
-    "locations.get('meeting1', 'online')"
+    "With the `.get()` method, we can access dictionary values with default values in a single line of code. This approach is not only more concise but also more readable and maintainable."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 3,
-   "id": "6de353f4",
-   "metadata": {
-    "ExecuteTime": {
-     "end_time": "2021-09-30T12:49:46.738598Z",
-     "start_time": "2021-09-30T12:49:46.729582Z"
-    }
-   },
+   "id": "d97082d7-c441-4b7e-a011-a4f152ac92f0",
+   "metadata": {},
    "outputs": [
     {
-     "data": {
-      "text/plain": [
-       "'online'"
-      ]
-     },
-     "execution_count": 3,
-     "metadata": {},
-     "output_type": "execute_result"
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "name='Alice'\n",
+      "age=30\n",
+      "email='no-email@example.com'\n"
+     ]
     }
    ],
    "source": [
-    "locations.get('meeting3', 'online')"
+    "# Using .get() method for cleaner code\n",
+    "user_data = {\"name\": \"Alice\", \"age\": 30}\n",
+    "\n",
+    "name = user_data.get(\"name\", \"Unknown\")\n",
+    "age = user_data.get(\"age\", 0)\n",
+    "email = user_data.get(\"email\", \"no-email@example.com\")\n",
+    "\n",
+    "print(f\"{name=}\")\n",
+    "print(f\"{age=}\")\n",
+    "print(f\"{email=}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c2dd1725",
+   "metadata": {},
+   "source": [
+    "If you want to get the default value when a key doesn't exist in a dictionary, use `dict.get`. In the code below, since there is no key `meeting3`, the default value `online` is returned. "
    ]
   },
   {
 
@@ -2000,6 +2000,199 @@
     "SquareNumbers(lit(1), lit(3)).show()"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "216e9fc0-12a9-4f45-85b1-8e791755b1d3",
+   "metadata": {},
+   "source": [
+    "### Best Practices for PySpark DataFrame Comparison Testing"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9badb0ee-16ec-4291-9477-8a38ebd7e876",
+   "metadata": {
+    "editable": true,
+    "slideshow": {
+     "slide_type": ""
+    },
+    "tags": [
+     "hide-cell"
+    ]
+   },
+   "outputs": [],
+   "source": [
+    "!pip install \"pyspark[sql]\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 19,
+   "id": "d2adcd65-5197-404f-88d6-c368a863cf75",
+   "metadata": {
+    "editable": true,
+    "slideshow": {
+     "slide_type": ""
+    },
+    "tags": [
+     "hide-cell"
+    ]
+   },
+   "outputs": [],
+   "source": [
+    "from pyspark.sql import SparkSession\n",
+    "\n",
+    "# Create SparkSession\n",
+    "spark = SparkSession.builder.getOrCreate()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1002536a",
+   "metadata": {},
+   "source": [
+    "Manually comparing PySpark DataFrame outputs using `collect()` and equality comparison leads to brittle tests due to ordering issues and unclear error messages when data doesn't match expectations.\n",
+    "\n",
+    "For example, the following test will fail due to ordering issues, resulting in an unclear error message.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 31,
+   "id": "e4299f30",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "                                                                                \r"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "assert [Row(id=1, name='Alice', value=100), Row(id=2, name='Bob', value=200)] == [Row(id=2, name='Bob', value=200), Row(id=1, name='Alice', value=100)]\n",
+      " +  where [Row(id=1, name='Alice', value=100), Row(id=2, name='Bob', value=200)] = <bound method DataFrame.collect of DataFrame[id: bigint, name: string, value: bigint]>()\n",
+      " +    where <bound method DataFrame.collect of DataFrame[id: bigint, name: string, value: bigint]> = DataFrame[id: bigint, name: string, value: bigint].collect\n",
+      " +  and   [Row(id=2, name='Bob', value=200), Row(id=1, name='Alice', value=100)] = <bound method DataFrame.collect of DataFrame[id: bigint, name: string, value: bigint]>()\n",
+      " +    where <bound method DataFrame.collect of DataFrame[id: bigint, name: string, value: bigint]> = DataFrame[id: bigint, name: string, value: bigint].collect\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Manual DataFrame comparison\n",
+    "result_df = spark.createDataFrame(\n",
+    "    [(1, \"Alice\", 100), (2, \"Bob\", 200)], [\"id\", \"name\", \"value\"]\n",
+    ")\n",
+    "\n",
+    "expected_df = spark.createDataFrame(\n",
+    "    [(2, \"Bob\", 200), (1, \"Alice\", 100)], [\"id\", \"name\", \"value\"]\n",
+    ")\n",
+    "\n",
+    "try:\n",
+    "    assert result_df.collect() == expected_df.collect()\n",
+    "except AssertionError as e:\n",
+    "    print(e)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7c4f8fd8-c2c2-4804-8e42-6fd3eb6aec27",
+   "metadata": {},
+   "source": [
+    "`assertDataFrameEqual` provides a robust way to compare DataFrames, allowing for order-independent comparison.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "73b0d483-8b00-44ab-9279-4c7765ca1ff6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Testing with DataFrame equality\n",
+    "from pyspark.testing.utils import assertDataFrameEqual"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "7c46ae8a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "assertDataFrameEqual(result_df, expected_df)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "085f150d-20ff-4b0a-a4ab-1ee452598e9e",
+   "metadata": {},
+   "source": [
+    "Using `collect()` for comparison cannot detect type mismatch, whereas `assertDataFrameEqual` can.\n",
+    "\n",
+    "For example, the following test will pass, even though there is a type mismatch.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 27,
+   "id": "da7494c0-c05f-4a2f-a411-805c8f2f73ba",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Manual DataFrame comparison\n",
+    "result_df = spark.createDataFrame(\n",
+    "    [(1, \"Alice\", 100), (2, \"Bob\", 200)], [\"id\", \"name\", \"value\"]\n",
+    ")\n",
+    "\n",
+    "expected_df = spark.createDataFrame(\n",
+    "    [(1, \"Alice\", 100.0), (2, \"Bob\", 200.0)], [\"id\", \"name\", \"value\"]\n",
+    ")\n",
+    "\n",
+    "assert result_df.collect() == expected_df.collect()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "82914b3b-69c0-4c68-9d72-d2ce31417397",
+   "metadata": {},
+   "source": [
+    "The error message produced by `assertDataFrameEqual` is clear and informative, highlighting the difference in schemas."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 30,
+   "id": "3faa1dbc-887a-4c36-ace8-c621411c3fb7",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[DIFFERENT_SCHEMA] Schemas do not match.\n",
+      "--- actual\n",
+      "+++ expected\n",
+      "- StructType([StructField('id', LongType(), True), StructField('name', StringType(), True), StructField('value', LongType(), True)])\n",
+      "?                                                                                                                ^ ^^\n",
+      "\n",
+      "+ StructType([StructField('id', LongType(), True), StructField('name', StringType(), True), StructField('value', DoubleType(), True)])\n",
+      "?                                                                                                                ^ ^^^^\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "try:\n",
+    "    assertDataFrameEqual(result_df, expected_df)\n",
+    "except AssertionError as e:\n",
+    "    print(e)"
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "9da7e800",
 
@@ -272,7 +272,24 @@
 <li class="toctree-l2"><a class="reference internal" href="../Chapter5/testing.html">6.13. Testing</a></li>
 <li class="toctree-l2"><a class="reference internal" href="../Chapter5/SQL.html">6.14. SQL Libraries</a></li>
 <li class="toctree-l2"><a class="reference internal" href="../Chapter5/spark.html">6.15. 3 Powerful Ways to Create PySpark DataFrames</a></li>
-<li class="toctree-l2"><a class="reference internal" href="../Chapter5/llm.html">6.16. Large Language Model (LLM)</a></li>
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+<li class="toctree-l2"><a class="reference internal" href="../Chapter5/llm.html">6.33. Large Language Model (LLM)</a></li>
 </ul>
 </li>
 <li class="toctree-l1 has-children"><a class="reference internal" href="../Chapter6/Chapter6.html">7. Cool Tools</a><input class="toctree-checkbox" id="toctree-checkbox-7" name="toctree-checkbox-7" type="checkbox"/><label class="toctree-toggle" for="toctree-checkbox-7"><i class="fa-solid fa-chevron-down"></i></label><ul>