|
2 | 2 | "cells": [ |
3 | 3 | { |
4 | 4 | "cell_type": "markdown", |
5 | | - "id": "8ea79679", |
| 5 | + "id": "988c67a4", |
6 | 6 | "metadata": {}, |
7 | 7 | "source": [ |
8 | 8 | "## Manage Data" |
9 | 9 | ] |
10 | 10 | }, |
11 | 11 | { |
12 | 12 | "cell_type": "markdown", |
13 | | - "id": "17975d80", |
| 13 | + "id": "7aefb1e9", |
14 | 14 | "metadata": {}, |
15 | 15 | "source": [ |
16 | 16 | "This section covers some tools to work with your data." |
17 | 17 | ] |
18 | 18 | }, |
19 | 19 | { |
20 | 20 | "cell_type": "markdown", |
21 | | - "id": "f3edb931", |
| 21 | + "id": "ae5bd151", |
22 | 22 | "metadata": {}, |
23 | 23 | "source": [ |
24 | 24 | "### DVC: A Data Version Control Tool for Your Data Science Projects" |
|
27 | 27 | { |
28 | 28 | "cell_type": "code", |
29 | 29 | "execution_count": null, |
30 | | - "id": "0e816bc8", |
| 30 | + "id": "7647c646", |
31 | 31 | "metadata": { |
32 | 32 | "tags": [ |
33 | 33 | "hide-cell" |
|
40 | 40 | }, |
41 | 41 | { |
42 | 42 | "cell_type": "markdown", |
43 | | - "id": "a4452b14", |
| 43 | + "id": "8a2e7f56", |
44 | 44 | "metadata": {}, |
45 | 45 | "source": [ |
46 | 46 | "While Git excels at versioning code, managing data versions can be tricky. DVC (Data Version Control) bridges this gap by allowing you to track data changes alongside your code, while keeping the actual data separate. It's like Git for data.\n", |
|
50 | 50 | }, |
51 | 51 | { |
52 | 52 | "cell_type": "markdown", |
53 | | - "id": "5e2dc416", |
| 53 | + "id": "f34b0247", |
54 | 54 | "metadata": {}, |
55 | 55 | "source": [ |
56 | 56 | "```bash\n", |
|
79 | 79 | }, |
80 | 80 | { |
81 | 81 | "cell_type": "markdown", |
82 | | - "id": "7e314b83", |
| 82 | + "id": "925a11c6", |
83 | 83 | "metadata": {}, |
84 | 84 | "source": [ |
85 | 85 | "[Link to DVC](https://dvc.org/)" |
86 | 86 | ] |
87 | 87 | }, |
88 | 88 | { |
89 | 89 | "cell_type": "markdown", |
90 | | - "id": "ca04203e", |
| 90 | + "id": "55a8e060", |
91 | 91 | "metadata": {}, |
92 | 92 | "source": [ |
93 | 93 | "### sweetviz: Compare the similar features between 2 different datasets" |
|
96 | 96 | { |
97 | 97 | "cell_type": "code", |
98 | 98 | "execution_count": null, |
99 | | - "id": "fc18afd3", |
| 99 | + "id": "924c4128", |
100 | 100 | "metadata": { |
101 | 101 | "tags": [ |
102 | 102 | "hide-cell" |
|
109 | 109 | }, |
110 | 110 | { |
111 | 111 | "cell_type": "markdown", |
112 | | - "id": "7e5eb4eb", |
| 112 | + "id": "5735e44b", |
113 | 113 | "metadata": {}, |
114 | 114 | "source": [ |
115 | 115 | "When comparing datasets, such as training and testing sets, sweetviz helps visualize similarities and differences with ease.\n", |
|
120 | 120 | { |
121 | 121 | "cell_type": "code", |
122 | 122 | "execution_count": null, |
123 | | - "id": "3befe045", |
| 123 | + "id": "19d669cb", |
124 | 124 | "metadata": {}, |
125 | 125 | "outputs": [], |
126 | 126 | "source": [ |
|
137 | 137 | }, |
138 | 138 | { |
139 | 139 | "cell_type": "markdown", |
140 | | - "id": "a9e3f264", |
| 140 | + "id": "e3d5ce0e", |
141 | 141 | "metadata": {}, |
142 | 142 | "source": [ |
143 | 143 | "" |
144 | 144 | ] |
145 | 145 | }, |
146 | 146 | { |
147 | 147 | "cell_type": "markdown", |
148 | | - "id": "e6035f7e", |
| 148 | + "id": "ef7ea0d9", |
149 | 149 | "metadata": {}, |
150 | 150 | "source": [ |
151 | 151 | "[Link to sweetviz](https://github.com/fbdesignpro/sweetviz)" |
152 | 152 | ] |
153 | 153 | }, |
154 | 154 | { |
155 | 155 | "cell_type": "markdown", |
156 | | - "id": "c9f7d411", |
| 156 | + "id": "a7417c5d", |
157 | 157 | "metadata": {}, |
158 | 158 | "source": [ |
159 | 159 | "### quadratic: Data Science Speadsheet with Python and SQL\n", |
|
167 | 167 | }, |
168 | 168 | { |
169 | 169 | "cell_type": "markdown", |
170 | | - "id": "af81e694", |
| 170 | + "id": "2bf92ed6", |
171 | 171 | "metadata": {}, |
172 | 172 | "source": [ |
173 | 173 | "### whylogs: Data Logging Made Easy" |
|
176 | 176 | { |
177 | 177 | "cell_type": "code", |
178 | 178 | "execution_count": null, |
179 | | - "id": "44f003ec", |
| 179 | + "id": "baa5966b", |
180 | 180 | "metadata": { |
181 | 181 | "tags": [ |
182 | 182 | "hide-cell" |
|
189 | 189 | }, |
190 | 190 | { |
191 | 191 | "cell_type": "markdown", |
192 | | - "id": "515fed97", |
| 192 | + "id": "d4c3afb5", |
193 | 193 | "metadata": {}, |
194 | 194 | "source": [ |
195 | 195 | "Keeping track of dataset statistics is crucial for data quality and monitoring. whylogs makes logging dataset summaries straightforward.\n", |
|
200 | 200 | { |
201 | 201 | "cell_type": "code", |
202 | 202 | "execution_count": null, |
203 | | - "id": "d6da1d5f", |
| 203 | + "id": "01e06b51", |
204 | 204 | "metadata": {}, |
205 | 205 | "outputs": [], |
206 | 206 | "source": [ |
|
227 | 227 | { |
228 | 228 | "cell_type": "code", |
229 | 229 | "execution_count": null, |
230 | | - "id": "05a506dd", |
| 230 | + "id": "301dc0bd", |
231 | 231 | "metadata": {}, |
232 | 232 | "outputs": [], |
233 | 233 | "source": [ |
|
237 | 237 | { |
238 | 238 | "cell_type": "code", |
239 | 239 | "execution_count": null, |
240 | | - "id": "741bf81a", |
| 240 | + "id": "e210ef80", |
241 | 241 | "metadata": {}, |
242 | 242 | "outputs": [], |
243 | 243 | "source": [ |
|
246 | 246 | }, |
247 | 247 | { |
248 | 248 | "cell_type": "markdown", |
249 | | - "id": "b2a362a1", |
| 249 | + "id": "e66eacfd", |
250 | 250 | "metadata": {}, |
251 | 251 | "source": [ |
252 | 252 | "[Link to whylogs](https://github.com/whylabs/whylogs)." |
253 | 253 | ] |
254 | 254 | }, |
255 | 255 | { |
256 | 256 | "cell_type": "markdown", |
257 | | - "id": "08a8d247", |
| 257 | + "id": "9e79ab66", |
258 | 258 | "metadata": {}, |
259 | 259 | "source": [ |
260 | 260 | "### Fluke: The Easiest Way to Move Data Around" |
261 | 261 | ] |
262 | 262 | }, |
263 | 263 | { |
264 | 264 | "cell_type": "markdown", |
265 | | - "id": "e3e3b948", |
| 265 | + "id": "cf77769e", |
266 | 266 | "metadata": {}, |
267 | 267 | "source": [ |
268 | 268 | "Transferring data between locations—such as from a remote server to cloud storage—can be cumbersome, especially with Python libraries that involve complex HTTP/SSH connections and directory handling. \n", |
|
274 | 274 | }, |
275 | 275 | { |
276 | 276 | "cell_type": "markdown", |
277 | | - "id": "5d914362", |
| 277 | + "id": "b8321eb0", |
278 | 278 | "metadata": {}, |
279 | 279 | "source": [ |
280 | 280 | "```python\n", |
|
297 | 297 | }, |
298 | 298 | { |
299 | 299 | "cell_type": "markdown", |
300 | | - "id": "553b2fe5", |
| 300 | + "id": "7bdf6158", |
301 | 301 | "metadata": {}, |
302 | 302 | "source": [ |
303 | 303 | "```python\n", |
|
313 | 313 | }, |
314 | 314 | { |
315 | 315 | "cell_type": "markdown", |
316 | | - "id": "dd7831e3", |
| 316 | + "id": "59acab4e", |
317 | 317 | "metadata": {}, |
318 | 318 | "source": [ |
319 | 319 | "[Link to Fluke](https://github.com/manoss96/fluke)." |
320 | 320 | ] |
321 | 321 | }, |
322 | 322 | { |
323 | 323 | "cell_type": "markdown", |
324 | | - "id": "000f015b", |
| 324 | + "id": "d3636ffa", |
325 | 325 | "metadata": {}, |
326 | 326 | "source": [ |
327 | 327 | "### safetensors: A Simple and Safe Way to Store and Distribute Tensors" |
|
330 | 330 | { |
331 | 331 | "cell_type": "code", |
332 | 332 | "execution_count": null, |
333 | | - "id": "57f18677", |
| 333 | + "id": "35a7b733", |
334 | 334 | "metadata": { |
335 | 335 | "tags": [ |
336 | 336 | "hide-cell" |
|
343 | 343 | }, |
344 | 344 | { |
345 | 345 | "cell_type": "markdown", |
346 | | - "id": "0d21ea72", |
| 346 | + "id": "2a139f90", |
347 | 347 | "metadata": {}, |
348 | 348 | "source": [ |
349 | 349 | "PyTorch defaults to using Pickle for tensor storage, which poses security risks as malicious pickle files can execute arbitrary code upon unpickling. In contrast, safetensors specialize in securely storing tensors, guaranteeing data integrity during storage and retrieval. \n", |
|
354 | 354 | { |
355 | 355 | "cell_type": "code", |
356 | 356 | "execution_count": null, |
357 | | - "id": "9011bd3b", |
| 357 | + "id": "fe38da49", |
358 | 358 | "metadata": { |
359 | 359 | "editable": true, |
360 | 360 | "slideshow": { |
|
381 | 381 | }, |
382 | 382 | { |
383 | 383 | "cell_type": "markdown", |
384 | | - "id": "36685b4e", |
| 384 | + "id": "79304249", |
385 | 385 | "metadata": {}, |
386 | 386 | "source": [ |
387 | 387 | "[Link to safetensors](https://bit.ly/3vqzbhl)." |
388 | 388 | ] |
389 | 389 | }, |
390 | 390 | { |
391 | 391 | "cell_type": "markdown", |
392 | | - "id": "a85ca417", |
| 392 | + "id": "ef96faf4", |
393 | 393 | "metadata": {}, |
394 | 394 | "source": [ |
395 | 395 | "### datacompy: Smart Data Comparison Made Simple" |
|
398 | 398 | { |
399 | 399 | "cell_type": "code", |
400 | 400 | "execution_count": null, |
401 | | - "id": "eed29606", |
| 401 | + "id": "0db96d3d", |
402 | 402 | "metadata": { |
403 | 403 | "editable": true, |
404 | 404 | "slideshow": { |
|
415 | 415 | }, |
416 | 416 | { |
417 | 417 | "cell_type": "markdown", |
418 | | - "id": "8c084d13", |
| 418 | + "id": "c578fa05", |
419 | 419 | "metadata": {}, |
420 | 420 | "source": [ |
421 | 421 | "Data analysts and data engineers often struggle with comparing two datasets. This results in writing complex code to compare values, identify mismatches, and generate comparison reports." |
|
424 | 424 | { |
425 | 425 | "cell_type": "code", |
426 | 426 | "execution_count": null, |
427 | | - "id": "fc084d3e", |
| 427 | + "id": "17601c72", |
428 | 428 | "metadata": {}, |
429 | 429 | "outputs": [], |
430 | 430 | "source": [ |
|
447 | 447 | { |
448 | 448 | "cell_type": "code", |
449 | 449 | "execution_count": null, |
450 | | - "id": "9807d701", |
| 450 | + "id": "dcb8aa9e", |
451 | 451 | "metadata": {}, |
452 | 452 | "outputs": [], |
453 | 453 | "source": [ |
|
466 | 466 | { |
467 | 467 | "cell_type": "code", |
468 | 468 | "execution_count": null, |
469 | | - "id": "9efe2035", |
| 469 | + "id": "68bbbe78", |
470 | 470 | "metadata": {}, |
471 | 471 | "outputs": [], |
472 | 472 | "source": [ |
|
488 | 488 | }, |
489 | 489 | { |
490 | 490 | "cell_type": "markdown", |
491 | | - "id": "23942bf8", |
| 491 | + "id": "66517fc1", |
492 | 492 | "metadata": {}, |
493 | 493 | "source": [ |
494 | 494 | "With datacompy, you can easily compare datasets and get detailed reports about differences, including matching percentage, column-level comparison, and sample mismatches. You can use it with various data frameworks like Pandas, Spark, Polars, and Snowflake." |
|
497 | 497 | { |
498 | 498 | "cell_type": "code", |
499 | 499 | "execution_count": null, |
500 | | - "id": "6c20e21a", |
| 500 | + "id": "0aa3e7cb", |
501 | 501 | "metadata": {}, |
502 | 502 | "outputs": [], |
503 | 503 | "source": [ |
|
509 | 509 | { |
510 | 510 | "cell_type": "code", |
511 | 511 | "execution_count": null, |
512 | | - "id": "c67afacd", |
| 512 | + "id": "aba82c40", |
513 | 513 | "metadata": {}, |
514 | 514 | "outputs": [], |
515 | 515 | "source": [ |
|
518 | 518 | }, |
519 | 519 | { |
520 | 520 | "cell_type": "markdown", |
521 | | - "id": "5dfa9a3f", |
| 521 | + "id": "2a5eec50", |
522 | 522 | "metadata": {}, |
523 | 523 | "source": [ |
524 | 524 | "[Link to datacompy](https://github.com/capitalone/datacompy)." |
|
0 commit comments