Skip to content

Commit 864a6c3

Browse files
Update documentation
1 parent db49d9b commit 864a6c3

File tree

8 files changed

+406
-409
lines changed

8 files changed

+406
-409
lines changed

README.html

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -498,8 +498,7 @@ <h2> Contents </h2>
498498
<div id="searchbox"></div>
499499
<article class="bd-article" role="main">
500500

501-
<p><a class="reference external" href="https://codecut.ai/?utm_source=github&amp;utm_medium=efficient_python_tricks&amp;utm_campaign=github_banner"><img src="img/codecut.jpg"></a></p>
502-
<div align="center">
501+
<div align="center">
503502
<h1 align="center">
504503
Efficient Python Tricks and Tools for Data Scientists
505504
</h3>

_sources/Chapter5/manage_data.ipynb

Lines changed: 41 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -2,23 +2,23 @@
22
"cells": [
33
{
44
"cell_type": "markdown",
5-
"id": "8ea79679",
5+
"id": "988c67a4",
66
"metadata": {},
77
"source": [
88
"## Manage Data"
99
]
1010
},
1111
{
1212
"cell_type": "markdown",
13-
"id": "17975d80",
13+
"id": "7aefb1e9",
1414
"metadata": {},
1515
"source": [
1616
"This section covers some tools to work with your data."
1717
]
1818
},
1919
{
2020
"cell_type": "markdown",
21-
"id": "f3edb931",
21+
"id": "ae5bd151",
2222
"metadata": {},
2323
"source": [
2424
"### DVC: A Data Version Control Tool for Your Data Science Projects"
@@ -27,7 +27,7 @@
2727
{
2828
"cell_type": "code",
2929
"execution_count": null,
30-
"id": "0e816bc8",
30+
"id": "7647c646",
3131
"metadata": {
3232
"tags": [
3333
"hide-cell"
@@ -40,7 +40,7 @@
4040
},
4141
{
4242
"cell_type": "markdown",
43-
"id": "a4452b14",
43+
"id": "8a2e7f56",
4444
"metadata": {},
4545
"source": [
4646
"While Git excels at versioning code, managing data versions can be tricky. DVC (Data Version Control) bridges this gap by allowing you to track data changes alongside your code, while keeping the actual data separate. It's like Git for data.\n",
@@ -50,7 +50,7 @@
5050
},
5151
{
5252
"cell_type": "markdown",
53-
"id": "5e2dc416",
53+
"id": "f34b0247",
5454
"metadata": {},
5555
"source": [
5656
"```bash\n",
@@ -79,15 +79,15 @@
7979
},
8080
{
8181
"cell_type": "markdown",
82-
"id": "7e314b83",
82+
"id": "925a11c6",
8383
"metadata": {},
8484
"source": [
8585
"[Link to DVC](https://dvc.org/)"
8686
]
8787
},
8888
{
8989
"cell_type": "markdown",
90-
"id": "ca04203e",
90+
"id": "55a8e060",
9191
"metadata": {},
9292
"source": [
9393
"### sweetviz: Compare the similar features between 2 different datasets"
@@ -96,7 +96,7 @@
9696
{
9797
"cell_type": "code",
9898
"execution_count": null,
99-
"id": "fc18afd3",
99+
"id": "924c4128",
100100
"metadata": {
101101
"tags": [
102102
"hide-cell"
@@ -109,7 +109,7 @@
109109
},
110110
{
111111
"cell_type": "markdown",
112-
"id": "7e5eb4eb",
112+
"id": "5735e44b",
113113
"metadata": {},
114114
"source": [
115115
"When comparing datasets, such as training and testing sets, sweetviz helps visualize similarities and differences with ease.\n",
@@ -120,7 +120,7 @@
120120
{
121121
"cell_type": "code",
122122
"execution_count": null,
123-
"id": "3befe045",
123+
"id": "19d669cb",
124124
"metadata": {},
125125
"outputs": [],
126126
"source": [
@@ -137,23 +137,23 @@
137137
},
138138
{
139139
"cell_type": "markdown",
140-
"id": "a9e3f264",
140+
"id": "e3d5ce0e",
141141
"metadata": {},
142142
"source": [
143143
"![image](../img/sweetviz_output.png)"
144144
]
145145
},
146146
{
147147
"cell_type": "markdown",
148-
"id": "e6035f7e",
148+
"id": "ef7ea0d9",
149149
"metadata": {},
150150
"source": [
151151
"[Link to sweetviz](https://github.com/fbdesignpro/sweetviz)"
152152
]
153153
},
154154
{
155155
"cell_type": "markdown",
156-
"id": "c9f7d411",
156+
"id": "a7417c5d",
157157
"metadata": {},
158158
"source": [
159159
"### quadratic: Data Science Speadsheet with Python and SQL\n",
@@ -167,7 +167,7 @@
167167
},
168168
{
169169
"cell_type": "markdown",
170-
"id": "af81e694",
170+
"id": "2bf92ed6",
171171
"metadata": {},
172172
"source": [
173173
"### whylogs: Data Logging Made Easy"
@@ -176,7 +176,7 @@
176176
{
177177
"cell_type": "code",
178178
"execution_count": null,
179-
"id": "44f003ec",
179+
"id": "baa5966b",
180180
"metadata": {
181181
"tags": [
182182
"hide-cell"
@@ -189,7 +189,7 @@
189189
},
190190
{
191191
"cell_type": "markdown",
192-
"id": "515fed97",
192+
"id": "d4c3afb5",
193193
"metadata": {},
194194
"source": [
195195
"Keeping track of dataset statistics is crucial for data quality and monitoring. whylogs makes logging dataset summaries straightforward.\n",
@@ -200,7 +200,7 @@
200200
{
201201
"cell_type": "code",
202202
"execution_count": null,
203-
"id": "d6da1d5f",
203+
"id": "01e06b51",
204204
"metadata": {},
205205
"outputs": [],
206206
"source": [
@@ -227,7 +227,7 @@
227227
{
228228
"cell_type": "code",
229229
"execution_count": null,
230-
"id": "05a506dd",
230+
"id": "301dc0bd",
231231
"metadata": {},
232232
"outputs": [],
233233
"source": [
@@ -237,7 +237,7 @@
237237
{
238238
"cell_type": "code",
239239
"execution_count": null,
240-
"id": "741bf81a",
240+
"id": "e210ef80",
241241
"metadata": {},
242242
"outputs": [],
243243
"source": [
@@ -246,23 +246,23 @@
246246
},
247247
{
248248
"cell_type": "markdown",
249-
"id": "b2a362a1",
249+
"id": "e66eacfd",
250250
"metadata": {},
251251
"source": [
252252
"[Link to whylogs](https://github.com/whylabs/whylogs)."
253253
]
254254
},
255255
{
256256
"cell_type": "markdown",
257-
"id": "08a8d247",
257+
"id": "9e79ab66",
258258
"metadata": {},
259259
"source": [
260260
"### Fluke: The Easiest Way to Move Data Around"
261261
]
262262
},
263263
{
264264
"cell_type": "markdown",
265-
"id": "e3e3b948",
265+
"id": "cf77769e",
266266
"metadata": {},
267267
"source": [
268268
"Transferring data between locations—such as from a remote server to cloud storage—can be cumbersome, especially with Python libraries that involve complex HTTP/SSH connections and directory handling. \n",
@@ -274,7 +274,7 @@
274274
},
275275
{
276276
"cell_type": "markdown",
277-
"id": "5d914362",
277+
"id": "b8321eb0",
278278
"metadata": {},
279279
"source": [
280280
"```python\n",
@@ -297,7 +297,7 @@
297297
},
298298
{
299299
"cell_type": "markdown",
300-
"id": "553b2fe5",
300+
"id": "7bdf6158",
301301
"metadata": {},
302302
"source": [
303303
"```python\n",
@@ -313,15 +313,15 @@
313313
},
314314
{
315315
"cell_type": "markdown",
316-
"id": "dd7831e3",
316+
"id": "59acab4e",
317317
"metadata": {},
318318
"source": [
319319
"[Link to Fluke](https://github.com/manoss96/fluke)."
320320
]
321321
},
322322
{
323323
"cell_type": "markdown",
324-
"id": "000f015b",
324+
"id": "d3636ffa",
325325
"metadata": {},
326326
"source": [
327327
"### safetensors: A Simple and Safe Way to Store and Distribute Tensors"
@@ -330,7 +330,7 @@
330330
{
331331
"cell_type": "code",
332332
"execution_count": null,
333-
"id": "57f18677",
333+
"id": "35a7b733",
334334
"metadata": {
335335
"tags": [
336336
"hide-cell"
@@ -343,7 +343,7 @@
343343
},
344344
{
345345
"cell_type": "markdown",
346-
"id": "0d21ea72",
346+
"id": "2a139f90",
347347
"metadata": {},
348348
"source": [
349349
"PyTorch defaults to using Pickle for tensor storage, which poses security risks as malicious pickle files can execute arbitrary code upon unpickling. In contrast, safetensors specialize in securely storing tensors, guaranteeing data integrity during storage and retrieval. \n",
@@ -354,7 +354,7 @@
354354
{
355355
"cell_type": "code",
356356
"execution_count": null,
357-
"id": "9011bd3b",
357+
"id": "fe38da49",
358358
"metadata": {
359359
"editable": true,
360360
"slideshow": {
@@ -381,15 +381,15 @@
381381
},
382382
{
383383
"cell_type": "markdown",
384-
"id": "36685b4e",
384+
"id": "79304249",
385385
"metadata": {},
386386
"source": [
387387
"[Link to safetensors](https://bit.ly/3vqzbhl)."
388388
]
389389
},
390390
{
391391
"cell_type": "markdown",
392-
"id": "a85ca417",
392+
"id": "ef96faf4",
393393
"metadata": {},
394394
"source": [
395395
"### datacompy: Smart Data Comparison Made Simple"
@@ -398,7 +398,7 @@
398398
{
399399
"cell_type": "code",
400400
"execution_count": null,
401-
"id": "eed29606",
401+
"id": "0db96d3d",
402402
"metadata": {
403403
"editable": true,
404404
"slideshow": {
@@ -415,7 +415,7 @@
415415
},
416416
{
417417
"cell_type": "markdown",
418-
"id": "8c084d13",
418+
"id": "c578fa05",
419419
"metadata": {},
420420
"source": [
421421
"Data analysts and data engineers often struggle with comparing two datasets. This results in writing complex code to compare values, identify mismatches, and generate comparison reports."
@@ -424,7 +424,7 @@
424424
{
425425
"cell_type": "code",
426426
"execution_count": null,
427-
"id": "fc084d3e",
427+
"id": "17601c72",
428428
"metadata": {},
429429
"outputs": [],
430430
"source": [
@@ -447,7 +447,7 @@
447447
{
448448
"cell_type": "code",
449449
"execution_count": null,
450-
"id": "9807d701",
450+
"id": "dcb8aa9e",
451451
"metadata": {},
452452
"outputs": [],
453453
"source": [
@@ -466,7 +466,7 @@
466466
{
467467
"cell_type": "code",
468468
"execution_count": null,
469-
"id": "9efe2035",
469+
"id": "68bbbe78",
470470
"metadata": {},
471471
"outputs": [],
472472
"source": [
@@ -488,7 +488,7 @@
488488
},
489489
{
490490
"cell_type": "markdown",
491-
"id": "23942bf8",
491+
"id": "66517fc1",
492492
"metadata": {},
493493
"source": [
494494
"With datacompy, you can easily compare datasets and get detailed reports about differences, including matching percentage, column-level comparison, and sample mismatches. You can use it with various data frameworks like Pandas, Spark, Polars, and Snowflake."
@@ -497,7 +497,7 @@
497497
{
498498
"cell_type": "code",
499499
"execution_count": null,
500-
"id": "6c20e21a",
500+
"id": "0aa3e7cb",
501501
"metadata": {},
502502
"outputs": [],
503503
"source": [
@@ -509,7 +509,7 @@
509509
{
510510
"cell_type": "code",
511511
"execution_count": null,
512-
"id": "c67afacd",
512+
"id": "aba82c40",
513513
"metadata": {},
514514
"outputs": [],
515515
"source": [
@@ -518,7 +518,7 @@
518518
},
519519
{
520520
"cell_type": "markdown",
521-
"id": "5dfa9a3f",
521+
"id": "2a5eec50",
522522
"metadata": {},
523523
"source": [
524524
"[Link to datacompy](https://github.com/capitalone/datacompy)."

0 commit comments

Comments
 (0)