You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
refactor: replace speed claims with intelligent engine selection messaging
- Update README to focus on comprehensive PII coverage vs raw performance
- Transform benchmark report from speed analysis to engine capability analysis
- Add industry-specific use cases (financial vs legal vs enterprise)
- Emphasize complementary engine strengths over competitive metrics
- Include auto mode fallback testing for complete performance picture
- Remove all "190x faster" claims pending industry-specific messaging strategy
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
DataFog is the fastest open-source library for detecting and anonymizing personally identifiable information (PII) in unstructured data. Built for production workloads, it delivers enterprise-grade performance without the complexity.
24
+
DataFog is a comprehensive open-source library for detecting and anonymizing personally identifiable information (PII) in unstructured data. Built for production workloads, it delivers intelligent engine selection to handle both structured identifiers and contextual entities across different industries and use cases.
25
25
26
26
## ⚡ Why Choose DataFog?
27
27
28
-
**🚀 Blazing Fast Performance**
29
-
-**190x faster** than spaCy for structured PII detection
30
-
- Sub-3ms processing times for most documents
31
-
- Optimized pattern engine with intelligent spaCy fallback
28
+
**🧠 Intelligent Engine Selection**
29
+
30
+
- Automatically chooses the best detection approach for your data
While the fast pattern engine is significantly faster (190x faster in our benchmarks), there are specific scenarios where you might want to use spaCy:
405
+
- Primary need: Both structured and contextual PII detection
406
+
- Recommended: `auto` engine for intelligent selection
407
+
- Benefits from both engines depending on content type
408
+
409
+
### When to Use Each Engine
399
410
400
-
1.**Complex entity recognition**: When you need to identify entities not covered by standard patterns, such as organization names, locations, or product names that don't follow predictable formats.
411
+
**Regex Engine**: Choose when you need to detect specific, well-formatted identifiers:
401
412
402
-
2.**Context-aware detection**: When the meaning of text depends on surrounding context that patterns cannot easily capture, such as distinguishing between a person's name and a company with the same name based on context.
413
+
- Processing structured databases or forms
414
+
- Compliance scanning for specific regulatory requirements (GDPR, HIPAA)
415
+
- High-volume processing where deterministic results are important
416
+
- Financial data with credit cards, SSNs, account numbers
403
417
404
-
3.**Multi-language support**: When processing text in languages other than English where standard patterns might need significant customization.
418
+
**SpaCy Engine**: Choose when you need contextual understanding:
405
419
406
-
4.**Research and exploration**: When experimenting with NLP capabilities and need the full power of a dedicated NLP library with features like part-of-speech tagging, dependency parsing, etc.
420
+
- Analyzing unstructured documents, emails, or communications
421
+
- Legal eDiscovery where names and organizations are key
422
+
- Content where entities don't follow standard patterns
423
+
- Multi-language support requirements
407
424
408
-
5.**Unknown entity types**: When you don't know in advance what types of entities might be present in your text and need a more general-purpose entity recognition approach.
425
+
**Auto Engine**: Choose for general-purpose PII detection:
409
426
410
-
For high-performance production systems processing large volumes of text with known entity types (emails, phone numbers, credit cards, etc.), the fast pattern engine is strongly recommended due to its significant speed advantage.
427
+
- Unknown or mixed content types
428
+
- Applications serving multiple industries
429
+
- When you want comprehensive coverage without manual engine selection
430
+
- Default choice for most production applications
411
431
412
-
### Running Benchmarks Locally
432
+
### Running Detection Tests
413
433
414
-
You can run the performance benchmarks locally using pytest-benchmark:
434
+
You can test the different engines locally using pytest:
0 commit comments