Now Reading
Ask HN: Google spam filter getting worse?

Ask HN: Google spam filter getting worse?

2023-01-20 10:50:27

Google probably lets some amount of known-spam emails through for data gathering. See this quote from Google’s “Rules of Machine Learning” [1] (A great resource by the way)

> Rule #34: In binary classification for filtering (such as spam detection or determining interesting emails), make small short-term sacrifices in performance for very clean data.

> In a filtering task, examples which are marked as negative are not shown to the user. Suppose you have a filter that blocks 75% of the negative examples at serving. You might be tempted to draw additional training data from the instances shown to users. For example, if a user marks an email as spam that your filter let through, you might want to learn from that.

> But this approach introduces sampling bias. You can gather cleaner data if instead during serving you label 1% of all traffic as “held out”, and send all held out examples to the user. Now your filter is blocking at least 74% of the negative examples. These held out examples can become your training data.

> Note that if your filter is blocking 95% of the negative examples or more, this approach becomes less viable. Even so, if you wish to measure serving performance, you can make an even tinier sample (say 0.1% or 0.001%). Ten thousand examples is enough to estimate performance quite accurately.

[1] https://developers.google.com/machine-learning/guides/rules-…

For months now, emails with subjects like “MCAfeeconfirmati0n–#21845315” and “confirmation#4073301981” have been hitting my inbox. These are such obvious spam emails that I’m unsure how the spam filters aren’t catching them. Reporting them as spam hasn’t done anything to catch them.

I have this same problem with Outlook. Starting probably 2-3 months ago I began receiving somewhere from 5-10 spam emails with titles like this a day directly into my inbox. Reporting them as spam helped a little and brought it down to maybe 1-5. But they’re obviously spam with subjects like Norton Confirmation, OuOrtIBGGvGIO, Life Insurance Offer, etc. with weird fonts and other stuff.

As a side note, a lot of these spam emails I get are from Gmail.

They’re multi part which seems to trip up Gmail, it seems one part is scanned and another displayed. Base64 decode the source parts and add a keyword filter for the “non-spam” text as it’s usually pretty static.

Yeah, it’s been happening to me for about a year now. I went as far as to make another email just to avoid it. Made me sad. I had that email address since 2008 or so.

Judging from my own spam label on gmail, those messages are part of the torrent of junk that is pouring out of Microsoft’s “hybrid on-premises exchange” egress VIPs. Basically some clown who pays Microsoft for quasi-hosted Exchange has a virus that sends spam, and Microsoft blesses it with the reputation of the customer egress addresses. Eventually, this will stop working for Microsoft but at this time it’s like waiting for Greenland to melt: inevitable, but takes a long time.

Also worth noting if you are trying to evaluate gmail’s classification performance that the vast majority of what they think was spam is not in your spam label, it got stopped with a 4xx error code at SMTP time. So you don’t really have a way to know the denominator.

And good luck getting off that list if you’re on a hosted VPS… they’re about impossible… I can get through to hotmail and o365, but not the outlook.com block. (shrug)

I’m relaying through SendGrid as I just don’t have that many emails coming from/through my server that it’s worth the lowest paid level (there is a free tier) to have to worry about it…

I’ve been considering setting up a higher end server (compared to the $20/mo vps I’d been using) at a data center and seeing what I can manage as a direct mail host without the relay. But 10x-ing my costs just doesn’t feel right for something that will take more time and not generate revenue that I’m not that passionate about.

For those curious, been looking at WildDuck mail which seems like an interesting structure and the features are cool, just not sure I want to go through it all. I’ve been using Mailu via docker-compose on DigitalOcean for a couple years for all my lesser used domains/addresses, relaying through SendGrid. It works but kind of annoying going through setting up each domain added through the relay.

I run my own mail server + spam filter, so I’ll chime in. I have seen a high uptick in spam making it to my inbox in the last two weeks. I primarily rely on Spamhaus blocklists + a Bayesian filter trained on old spam.

The uptick I have seen is going from 0-2 spams making it to my inbox to 10-20 spams making it to my inbox. When this has happened in the past, I have assumed it is spammers bypassing blocklists by finding new hosts, or by spammers finding a clever way to beat the filter. Usually after these big upticks, they drop off again suddenly, which makes me believe that it was a blocklist bypass and not a filter bypass (my filter is pretty weak and hasn’t been retrained/updated in many years.)

Given all the news about hacks with self-hosted Exchange, more likely they’re relaying through hosts with a built up trust… As good as Exchange + Outlook are as a user, it is pretty painful to see exploits in the wild like this.

The whole system just sucks as a whole, and feels too entrenched to come up with something better. Even a notify+pull system wouldn’t fix these kinds of exploits, even if they would correct end-user breaches.

I think it’s a combo of two things:

1) To get the best training data, you sometimes need to let things you’ve classified as spam into the inbox to verify that the user marks it as spam. It’s pretty standard for training a classification system to occasionally pass negative samples to verify their negativity.

2) The spam filter itself almost certainly has a latency budget, and if it can’t respond in time, the message is passed unfiltered. In other words I think the spam filter fails open. It’s probably just been down more lately.

I have been getting tons of PDFs which in the previews shows pictures of women. The subject and body of the emails just seems to be random words like in a seed phrase, and with some random single digit numbers. The email is sent from office, hotmail or gmail accounts and verifies. The TO field is also filled with other emails. I have been getting this for like 3 or 4 months, and report as spam does not work. In all the years I have had a gmail account it has never really been a problem.

Microsoft has the problem as well, it’s not just Google. Do they not filter outgoing?

  Message ID <9UOejz_TlFksgoyXm9GI5Q@notifications.google.com>
  Created at: Fri, Jan 20, 2023 at 9:14 AM (Delivered after 0 seconds)
  From: "Girl Shows Girl cast a lookSTART JOIN Muriel (Classroom)" <no-reply@classroom.google.com>
  To: XXXXXXXXX
  Subject: Class invitation: "Check Join now View gambling Babe amidcustity"
  SPF: PASS with IP 209.85.220.69 Learn more
  DKIM: 'PASS' with domain google.com Learn more
  DMARC: 'PASS' Learn more


  Message ID <DM6PR18MB3569050DD20FD0372DA98C9DCEC59@DM6PR18MB3569.namprd18.prod.outlook.com>
  Created at: Fri, Jan 20, 2023 at 4:50 AM (Delivered after 3 seconds)
  From: hoven patroo <hovenpatrool@hotmail.com>
  To: XXXXXXXXXX
  Subject: 名梦 t94396350
  SPF: PASS with IP 40.92.18.30 Learn more
  DKIM: 'PASS' with domain hotmail.com Learn more
  DMARC: 'PASS' Learn more

You would think they’d do some basic bayesian filtering. This was stuff we fought in 2002.

The first one is generated by apparent user actions from paid organizations. Although it’s clearly spam, you can see how this is difficult for a provider to tackle, because all of the superficial signals are good: authenticated user, paid account, using official APIs. Obviously they need to step up their defenses against abuses like sharing from docs, calendar, etc to stop bad actors from laundering their spam through Google’s highest-reputation internal senders.

When I worked in this area of gmail we called this the “russian urologist” problem. How do you correctly classify traffic like this when hypothetically some of your customers want to send and receive messages about viagra in russian? Casual observers will say that is spam but not to the russian urologist.

Another anecdotal datapoint, but – I haven’t noticed an uptick in actual spam making it to my primary inbox. I can’t give solid numbers, but it’s not been bad.

This includes a marked increase in crypto spam/phishing emails due to the cointracker email list breach – those have pretty much exclusively gone straight to Spam (including those using Google Sheets so it has an official Google sender email).

Again, just an anecdote, and I don’t doubt that you and anyone else reporting an increase is experiencing it.

I was just about to ask this on here! I regularly check junk mail just in case and it’s been crickets for a long time, but in the last couple months seem to get like 3-4 spam emails in there a day, and regularly into my inbox, usually a Geek Squad or McAfee “purchase” receipt. Very clearly spam.

Spam filtering is a cat and mouse game. The moment you think you have the “perfect” set of rules, scammers will figure out how to game them. Then you’ll have to make changes to handle the additional cases. Rinse and repeat.

I have anecdotally seen slightly more types of scam/phishing messages slip through the filter in recent weeks, but I assume it’ll go away in the next round of updates from Google’s side.

I’ve used Gmail since 2003 and consequently was (un)lucky enough to get my $FIRSTNAME@gmail.com – it’s certainly handy but boy do I get a lot of spam – 3-400 hundred a day I expect.

I’ve definitely noticed an uptick recently and what is most perplexing is that some seem like they’d be easy to catch – in fact, I set up some Gmail filters to do so and they seem to be working 100%.

I can only imagine, mine is a pretty popular name as well, I see quite a few entries where I get mail obviously to other people… It gets kind of annoying to say the least…

Examples like, someone put me on their Farm Equipment account, so I was getting receipts and marketing… Also got on someone’s college application, so funding notifications etc.

I have sometimes made an effort to contact the org or person in question… I did manage to change someone’s password for a dating site, and changed their profile to “I don’t know how email works” etc. When I couldn’t reach the person.

I don’t think there is a way to export filters from Gmail but they’re just a collection of simple rules like:

“Congrats CALLU !” in subject goes to spam

Like I alluded to in my post, it feels like these would be easy for Gmail to catch directly.

Yes, I’ve been seeing more in GMail, but that’s nothing compared to Google Photos spam and Google Calendar spam, which I get hit with every other day.

To maximize the ridiculousness, Google sends me an email thanking me for each image abuse report or chat abuse report done in Photos — but they don’t seem to be actually /doing/ anything about it.

Yes, it’s been measurably worse for somewhere on the order of months to years now.

I’m not sure what they’ve changed internally, because if they have talked about their engineering strategy for spam detection (which I doubt, since it’s probably asymmetric information), no one has shared writings about it.

Nevertheless, I get obvious spam in my inbox now, and important email occasionally goes straight to my spam filter now.

People here on HN have been speculating that they moved to some sort of machine learning model, probably because employees were incentivized to pervert the existing product for promotion purposes by gaming internal metrics to prove they’ve had an impact.

Not exactly spam, but quite often mail are badly sorted and promotional mail get into the main inbox. One of the main offender is aliexpress. They send everyday some mails from various addresses : buyer01.m@mail.aliexpress.com services01@aliexpress.com exclusive01@mail.aliexpress.com ae.like18@mail.aliexpress.com buyer-info18.m@mail.aliexpress.com

And every month or so they vary the numbers and I have to tell the filters to route them appropriately to the junk folder. (And I have to tell one mail at a time because if you try to select multiple with different mail addresses the filter doesn’t propose to add it to the filter list).

Yes. Google has loosened their spam filters. I have noticed.

My educated guess on why? Lawsuits from political parties, notification of class action litigation against Google and others, union notifications, insurance notifications, and similar emails ending up being caught by spam filters.

The lawsuits are piling up.

See Also

I think they have a target for “% of good emails filtered as spam” and their classifiers need to choose a lower recall operating point to hit that target, because the spam has gotten harder to detect.

I have a month old business email for my new company setup with GSuite and Google’s own on-boarding emails went directly to spam in that inbox. I haven’t marked any emails as spam with this new account yet.

It absolutely is. A few weeks ago I decided to create what are now dozens of rules to manually filter out spam and it has been extremely effective for me (20+ a day we’re hitting my work inbox).

My best filters target the “opt out / unsubscribe” language people put in their footers. I iterate a few times a week as things sneak in. I’ll never get 100% but the results have been very positive.

There has definitely been some more getting through the filters the past few weeks for me. Maybe January is a special month for spam or something.

The spam arms race continues to escalate. Broad availability of tools like ChatGPT has probably helped spammers in the short term.

If any good can come of this long term, it would be the ability for me to charge people to get an email into my inbox. This has been proposed multiple times over the decades, but has never been more needed or feasible than now.

Anecdotally all the spam I get is exactly the type of spam I got 20 years ago on the same email address. I don’t think it’s chatgpt but either a loosening of filters or something else.

I’m having the opposite problem. Sometimes even my replies to someone with a Gmail address go to their SPAM box. What kind of a filter decides you don’t want to see a message from some you messaged first?

FWIW I have my own domain and switched to Google as backend long ago, and yet I still occasionally have this problem.

There are several Google Groups that I subscribe to and this regularly happens:

A real person who I know in real life, whose messages I care about posts to Google Group from a Gmail account, and the message ends up in my own Gmail spam filter.

Like – the message didn’t even leave the Google infrastructure and it got tagged as spam?!

I’ve had email messages from Google about Google products for which I have an active account using my Google email address get marked as spam by Google spam filters.

I get 10-20 a day.

Lately it’s been Google classroom invitations from sex bots. Along with the random crap that doesn’t make any sense, and the McAfee/Yeti Cooler junk.

Yeah. I have an account that bounces through gmail as part of a forwarding chain. About 25% of its non-spam messages were being silently stuck there (and not forwarded) until I disabled the gmail spam filter. The downstream account (fastmail) spam filter works fine.

I absolutely had this issue most of last year, but the Gmail spam filter seems to be catching things more effectively for me in the last two months.

Try to treat it like weather. Some times things are clear for weeks, then you get hit with storms. My wife and I both have had Gmail accounts forever and we never see the onrushes of spam at the same time. So I think it’s the noise of two algorithms fighting. We should all get used to it.

I noticed this as well, switching to kind of a relatively new service called Tutanota as I haven’t heard great things about fastmail and protonmail when it comes to spam and looking for something using open source tooling. We’ll see how it goes.

Has Google ever publicly talked about their spam performance filter over time? For me this past year I get obvious spam messages in my inbox every week. Is it that they can longer filter at the required scale? It seems hard to believe these messages could evade even the most rudimentary filters, so I assume they’re not being filtered at all.

I’ve experienced kind of the inverse of that lately — using Workspace (and their domains) for email and regular outgoing emails are ending up in receiver’s spam box. SPF/DKIM/DMARC/etc all setup correctly, tested (and working fine for many years).

Dear god I thought I was the only one! For awhile I was getting multiple drive spam requests a day. Then they stopped. Yesterday I got my first one again in months and it just is not okay. They are all coming for clearly fake emails and sending Russian bitcoin shit. I know Google has OCR, maybe use it when someone invites 50k people to a drive and there are no social graphs related.

>”Is spam getting harder to filter?”

I look at it a different way, spam and filters are locked in an evolutionary arms race and at the moment spammers have found an adaptation gives them an advantage. In due time the anti-spam filters will adapt as well. It has always been a difficult problem.

Yes, it’s getting worse. I get and mark as spam the same email pattern over and over.

TL;DR: almost all the people who care about quality at Google are gone or not in a position to improve the product

I’ll try to address the specific question that seems to have been asked, which is about phishing. Phishing and spam are two different classes. Spam is largely classified based on metadata about the transaction and only to a lesser extent the body of the message. Phishing, on the other hand, is almost purely based on the content, because it revolves around stuff like the message seems to attempt to confuse the recipient about the sender’s identity, or includes URLs that appear to be intentionally confusing, or is using domain names that seem to have been intentionally formed to mimic your organization’s domains (for Workspace customers). So you are going to see very different outcomes for spam and for phishing, and quite different outcomes for gmail.com accounts vs. Workspace accounts.

Source Link

What's Your Reaction?
Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top