Data Attack Surface Reduction | Attack Surface Reduction Manifesto

Back to Index

Data is often called “the new oil,” but from a security perspective, data can be toxic waste — the more you collect and store, the more you have to lose. Data Attack Surface Reduction (ASR) is about collecting less, storing less, and exposing less data by design.

It challenges the “collect everything, analyze later” mindset and instead asks:

“Do we actually need this data, now or ever?”

By minimizing data collection, limiting its retention, and avoiding unnecessary sharing or transformation, we reduce the blast radius of breaches — or eliminate some altogether.

1. Collect Less Data by Default

The most secure data is data you never collected in the first place. Every field in your database, every API parameter, every log entry is a potential liability. Data minimalism should be your default posture.

The Principle

Adopt a policy of radical data minimalism: collect only what’s absolutely necessary for the product or service to function. Don’t gather speculative or “future-use” data — that’s security debt with no payoff.

Implementation

Challenge every data field during design reviews:

Do we really need birth dates, or just age verification (yes/no)?
Do we need full addresses, or just postal codes for shipping zones?
Do we need exact locations, or just city-level data for features?

Bad example: An e-commerce site collects full date of birth, mother’s maiden name, and security questions “in case we need them for support.”

Good example: The same site collects only “age over 18” flag and uses email-based verification for account recovery. No PII that could be weaponized in social engineering attacks.

Code Example

# BAD: Collecting more than needed
class UserProfile:
    email: str
    full_name: str
    date_of_birth: date
    phone_number: str
    address_line_1: str
    address_line_2: str
    city: str
    postal_code: str
    ssn_last_4: str  # Why???

# GOOD: Minimal collection
class UserProfile:
    email: str
    display_name: str  # Not necessarily real name
    shipping_postal_code: str  # Only if they order physical goods

Real-World Impact

The The Equifax breach exposed sensitive data of 147 million people including Social Security numbers, birth dates, and addresses. Much of this data was collected for credit checks but retained indefinitely, creating a massive target.

Why this matters: Less data collected = less data to protect = less damage if compromised = lower regulatory penalties = reduced storage costs.

Further reading:

2. Reduce Data Retention

Every day you keep data is another day it can be stolen, leaked, or misused. Delete data as soon as its utility ends.

The Problem

Organizations hoard data indefinitely “just in case” — often without clear justification. This creates:

Compliance risk (GDPR’s storage limitation principle)
Increased backup size and cost
Larger breach blast radius
Technical debt in data pipelines

Implementation Strategy

Apply aggressive time-based purging for:

Application logs: Keep 30-90 days maximum unless required otherwise

# Example: Automated log cleanup
# In your cron or systemd timer
find /var/log/app/ -type f -mtime +30 -delete

Database backups: Retain only what your disaster recovery plan requires (typically 30 days, not 5 years)

Inactive user data: Delete accounts inactive for 2+ years

-- Example deletion policy
DELETE FROM users 
WHERE last_login < NOW() - INTERVAL '2 years'
  AND account_status = 'inactive';

Temporary data: Session data, cache, uploaded files should have TTLs

Real-World Example

Leaked S3 buckets full of old database backups are a common breach source. In 2019, Capital One breach exposed 100 million+ records including data from backups spanning multiple years. These old backups served no business purpose — they were just ticking time bombs.

In 2020, Wattpad breach exposed account data, including old records that should have been purged years earlier.

Compliance Bonus

GDPR requires “storage limitation” — data should be kept only as long as necessary. Automatic deletion isn’t just security hygiene; it’s legal compliance.

Further reading:

3. Limit Data Propagation and Transformation

The more you move or copy data, the more risk you create. Every pipeline, every integration, every environment sync multiplies your attack surface.

Common Anti-Patterns

Copying production data to dev/staging: This is shockingly common and incredibly dangerous

Production database dumps loaded into developer laptops
Staging environments with real customer PII
Test environments without proper access controls

Uncontrolled ETL pipelines: Data flowing through multiple systems

Data lakes ingesting everything “just in case”
Real-time streams with overly broad scopes
Analytics pipelines that copy full datasets

Third-party integration overload: Syncing PII into multiple SaaS tools

Marketing platforms receiving full customer profiles
Support tools with access to payment info
Analytics trackers collecting more than necessary

Best Practices

Do analytics in-place where possible:

-- GOOD: Query prod read-replica with aggregation
SELECT DATE(created_at), COUNT(*), AVG(order_value)
FROM orders
WHERE created_at > NOW() - INTERVAL '30 days'
GROUP BY DATE(created_at);

-- BAD: Exporting full orders table to data warehouse daily
EXPORT orders TO 's3://analytics-bucket/full-dump/';

Mask or anonymize dev/test data:

# Example: Data masking for development
def mask_for_dev(user_data):
    return {
        'email': hashlib.sha256(user_data['email'].encode()).hexdigest()[:8] + '@example.com',
        'name': 'Test User ' + str(user_data['id']),
        'phone': 'XXX-XXX-' + user_data['phone'][-4:],
        # Retain structure, remove actual PII
    }

Restrict third-party data sharing: Remove or audit integrations

Review what your analytics tags are collecting
Check what your CRM integration actually needs
Audit OAuth scopes granted to third parties

Real-World Incident

The 2022 Meta Pixel hospital case revealed that healthcare providers inadvertently sent appointment details, medical conditions, and physician information to Facebook due to excessive analytics integration. HIPAA-protected data was being exfiltrated through tracking pixels that “nobody thought about.”

In 2021, Codecov supply chain attack modified their bash uploader script to exfiltrate environment variables from CI/CD systems, potentially capturing secrets, tokens, and credentials from thousands of customers’ build pipelines.

Why this matters: Each data copy is an opportunity for exposure. Each transformation is a potential leak point. Each integration is a trust decision.

Further reading:

4. Don’t Over-Engineer Analytics

Massive data lakes, complex Kafka pipelines, and real-time dashboards often include overly broad access controls, insecure intermediate stores, and excessive logging of sensitive fields. The complexity itself becomes the vulnerability.

The Problem

Modern analytics infrastructure is often:

Over-scoped: Collecting far more data than actually analyzed
Over-permissioned: Analytics systems with read access to everything
Under-governed: No clear ownership of what’s being collected
Over-retained: Data kept “forever” because storage is cheap

ASR Principle

Just because you can analyze it doesn’t mean you should store it.

Focus on:

Aggregated over raw data: Store metrics, not individual events
Sampling over exhaustive capture: 1% sample often tells the same story as 100%
Simpler pipelines: Fewer moving parts = fewer opportunities for leaks

Implementation Examples

Instead of this:

# Collecting everything
analytics_events:
  - user_id
  - session_id
  - ip_address
  - user_agent
  - page_url
  - referrer
  - timestamp
  - custom_properties  # Unbounded object

Do this:

# Collect aggregates
analytics_metrics:
  - page_path  # Not full URL with query params
  - country_code  # Not IP address
  - browser_family  # Not full user agent
  - timestamp_hour  # Not exact timestamp
  - event_count

Real architecture shift:

Old approach: Stream all events → Kafka → Data lake → Multiple analytics tools New approach: Process events at edge → Aggregate → Store only summaries

Case Study

In 2019, researchers found that Facebook stored hundreds of millions of passwords in plaintext in internal logging systems. This happened because logging was overly verbose and captured authentication payloads “for debugging.” This is over-engineered observability becoming a security incident.

Further reading:

Building Secure and Reliable Systems - Google - Chapter on data handling
Privacy-Preserving Analytics - Apple’s approach
Differential Privacy - Microsoft Research

5. Real-World Breach Patterns

Understanding how data breaches actually happen helps prioritize what to fix.

Pattern 1: Public Cloud Storage Exposure

The vulnerability: Misconfigured S3 buckets, Azure Blob containers, or Google Cloud Storage with public read access.

Real incidents:

2019: Capital One (mentioned earlier) ASR approach:

# AWS: Block public access by default (organizationally)
aws s3control put-public-access-block \
  --public-access-block-configuration \
  BlockPublicAcls=true,IgnorePublicAcls=true,\
  BlockPublicPolicy=true,RestrictPublicBuckets=true \
  --account-id 123456789012

# Regular audit
aws s3api list-buckets --query 'Buckets[*].[Name]' \
  | xargs -I {} aws s3api get-bucket-acl --bucket {}

Pattern 2: Log Leaks

The vulnerability: API tokens, passwords, or PII included in logs and shipped to centralized logging solutions.

Common mistakes:

Authorization headers logged in web server logs
Database query logs containing sensitive WHERE clauses
Error messages revealing internal system details
Debug logs pushed to production

Example bad logging:

# BAD
logger.info(f"User login: {email} with password {password}")
logger.debug(f"API call: {full_request_including_auth_header}")

# GOOD
logger.info(f"User login attempt: {email[:3]}***")
logger.debug(f"API call to {endpoint} - status {status_code}")

Pattern 3: Stale Environments

The vulnerability: Old production data retained in test environments without sufficient access control.

Scenario: Company creates “staging-2020”, “staging-2021”, “staging-2022” databases. The old ones still contain production data but nobody remembers to secure them or delete them.

ASR approach: Automated cleanup policies

# Example: Tag and auto-delete old environments
aws rds describe-db-instances \
  --query 'DBInstances[?TagList[?Key==`Environment` && Value==`staging`]]' \
  | jq -r '.[] | select(.InstanceCreateTime < "2023-01-01") | .DBInstanceIdentifier' \
  | xargs -I {} aws rds delete-db-instance --db-instance-identifier {} --skip-final-snapshot

Pattern 4: Third-Party Exfiltration

The vulnerability: Overly permissive tracking scripts, SDKs, or tags sending data externally without proper review.

Real incidents:

2022: Meta Pixel healthcare leaks (mentioned earlier)

ASR approach: Content Security Policy (CSP)

<meta http-equiv="Content-Security-Policy" 
      content="default-src 'self'; 
               script-src 'self' https://trusted-cdn.com; 
               connect-src 'self' https://api.yoursite.com;">

Further reading:

6. Guidelines for Practicing Data ASR

Principle	Action	How to Implement
Minimal Collection	Ask “why do we need this?” before collecting any new data	Add data collection reviews to design process; require written justification
Retention Limits	Automate expiry and deletion policies	Implement TTLs in databases; schedule cleanup jobs; use lifecycle policies
Access Control	Least privilege on data systems	Audit who can read/write data; remove unused permissions; use IAM policies
Anonymization	Use tokenization or hashing where PII isn’t needed	Implement pseudonymization for analytics; hash email addresses in logs
Tag & Tracker Review	Audit and limit third-party scripts collecting user info	Regular review of all external JavaScript; implement CSP headers
Incident Simulation	Run tabletop scenarios assuming your logs, exports, or staging DBs are leaked	Quarterly exercises asking “what if X database was public?”

7. Shift the Culture: Less Data Is a Win

Making data minimization part of your culture requires changing how you think about data:

Benefits of Collecting Less

Regulatory compliance becomes easier: GDPR, CCPA, DPDP, HIPAA all favor minimal data collection

Fewer data subject access requests (DSARs) to handle
Smaller scope for breach notification requirements
Lower fines when incidents do occur

Breach costs go down: The IBM Cost of a Data Breach Report 2023 shows average cost per record is $165. Fewer records = lower potential costs.

User trust goes up: Privacy-conscious users prefer services that collect less. Apple has built a brand around this principle.

Engineers get clarity: When you only store what matters, data models become clearer and queries become simpler.

How to Shift Mindset

Data deletion as a metric: Track and celebrate data removed
Privacy by default: Make “collect less” the default in design docs
Cost visibility: Show teams the storage and compliance costs of data hoarding
Regular audits: Quarterly reviews of what data exists and why

“Data is radioactive. Store only what you can shield. And only for as long as you need it.”

Further reading:

Final Thought

Every byte of data you don’t collect is a byte that can’t be stolen. Every record you delete is a record that can’t leak. Every integration you remove is one less potential exfiltration path.

Data ASR isn’t about crippling your product — it’s about being honest about what you actually need versus what you’re hoarding “just in case.”

Start today:

Identify your oldest dataset
Ask if anyone has used it in the past 6 months
If not, delete it
Repeat weekly

The best defense against data breaches is having less data to breach.