Search :
Affiliate Programs
Audio
Blogging
Computer Forensics
Computer Games
Data Recovery
Databases
Domain Names
E Learning
ECommerce
Email
File Types
Forums
Hardware
Information Technology
Internet Marketing
Intra net
Laptops
Link Popularity
Networks
Newsletters
Operating Systems
Programming
RSS
Security
SEM
SEO
SMO
Software
Spam
Videos
Web Design
Web Hosting
 
 
 
 

INVESTIGATION OF IMAGE SPAM FINGERPRINTS


Abstract
Image spam is unwanted e-mail in which text is embedded in an image to fool spam filters. Traditionally, spam filters catch spam by scanning messages for key words and by using other text-based techniques. However, according to several vendors, approximately 40% of all unwanted e-mail today is image-based spam. It thus, became clear that advanced spam filtering techniques are in order. One of the solutions is to use optical character recognition (OCR) and fingerprint analysis to catch image-based spam, but such schemes introduce a high overhead when implemented in software. The topic of this study is to explore the image processing techniques required for an image based spam-filter.
For instance, a good idea is to check through currently implemented OCR techniques in commercially available page scanners. Firstly, an attempt is made to choose one or more software algorithms to find an optimal solution and justify choices. The implementation and testing of the proposed model will be made in the second independent study.
1. Introduction
E-mail spam, also known as bulk e-mail or junk e-mail is a subset of spam that involves sending nearly identical messages to numerous recipients by e-mail. Some definitions of spam specifically include the aspects of email that is unsolicited and sent in bulk.
There is no exact definition of spam. Most of the spam can be termed as unwanted e-mail but not all of the unwanted e-mails are spam. Another term would be unsolicited commercial e-mail, but unfortunately spam is not only advertising material Spam can be also defined as junk mail but it implicates the question: what is a junk mail? Although most of the e-mail users know what spam is, but it is not obvious how to define spam and spamming. As a summary one could agree that spam is something unsolicited, unwanted email what is mostly also an advertisement material. However not all unwanted e-mail letters are spam and not all spam is an advertisement. It is not an exact definition of spam these are only properties in order to explain, what is the relationship between spam and other e-mail sets[1].
Spammers have redoubled their efforts to get past anti-spam filters with more ingenuity than ever. Undeterred by spam filters, they can churn out difficult-to-detect image spam messages to a high volume of recipients, increasing their chances of making money. For spammers who are looking to turn a quick profit, a picture can be worth thousands of dollars. To the undiscerning eye, image spam looks just like any other text email. The difference is that the image spam is exactly as the name implies: a .jpg or .gif graphic consisting of words embedded in the picture. Consequently, most anti-spam software is unable to detect this type of spam with filters that look at real text. Such filters never see the content because it’s an image, which appears invisible to the filter. Cyber-crime rings and other spammers who run legitimate businesses have turned to this technique to avoid the radar as they lure recipients to view pornography, try miracle drugs, sign-up for fake degrees, apply for low-interest loans or mortgage offers, and buy penny stocks. To make money, spammers instruct the recipient how to purchase the goods they are offering or participate in get-rich-quick schemes. In some cases, a link (URL) to a web site is included in the spam email, so a recipient can click on the link or type the URL into a web browser to reach the spammer’s site.[1]
2. The history of spam
The history of spam is divided in three different periods.
1. The early years spam letters were addressed and sent manually
2. Later spammers used machines, what leaded to the dramatically rise in the amount of spam
3. The final part is when machine learning appeared at spam filtering and made the filtering substantially effective.
3. Controlling Spam on User Side
The irritating-rate of spam letters depends on the e-mail habits of the users. For example if somebody receives 10 letters every week, and there is only one spam than it could be easily imaginable that to press the delete button is quicker and more comfortable than implementing any kinds of filtering. Spam causes problem usually for users who receive hundreds of e-mails per weeks. One can find really frustrating to press the delete button hundred times every week. In the first years, the main problem was the connection speed. Especially users connecting through modem to the Internet were suffered by waiting significantly more time, spent with downloading e-mail. To wait ten minutes for download a couple of letters can be painful, if the 70% of the letters were actually spam letters. Nowadays, when more and more users use broadband Internet connection, the downloading time plays a smaller role. Spam is usually short message and/or smaller picture. Today the problem is that the users have to face the necessity of deleting the unwanted letters, or the difficulty to pick out the important ones from the big amount of junks.
4. Spam Volume
The TRACE team monitors spam volume through its Spam Volume Index (SVI), which tracks the spam received by a representative sample of domains. The index shows that the overall spam volume remains high, although it has leveled off over Q2/2007 compared to the rapid growth experienced in 2H/2006.[3]


Figure 1: Marshal Spam Volume Index(SVI)[3]
4.1. Spam Sources by Country
Where spam comes from provides an interesting insight into how spammers distribute spam. About 70% of all spam originates from a dozen countries, mostly from compromised computers that are part spam-generating botnets. At the top of the list for the first half of 2007 is the United States, followed by China and Korea.[3]
5.0 Spam Categories
Although there are at least three main different kinds of spam and especially Advertising spam could be divided into many subcategories, all the spam has also a Content-free characteristic. The majority of spam follows common patterns, which could be clearly identified. More than 99% of spam falls into one or more of the categories Listed below.
5.1. Advertisement spam
Most spam is commercial advertisement, often a direct product offer. Spam costs the sender very little to send, compared to other advertisement methods. The most common subcategories of the advertisement spam are: Creation of a meta spam filter Csaba Gulyás.
Online Pharmacy spam: Spam promoting different versions of Viagra, Cialis, Anti-depressant pills that can be purchased online.
1. Penny Stock spam: Stock-encouraging spam, encouraging people to buy cheap stocks.
2. Porn or (sex-) dating spam: Porn-sites and (sex-) dating sites were often marketed via spam (nowadays its rate out of all spam is getting less and less, fortunately).
3. Pirate Software spam: spam offering pirate software, usually much more cheaper than the official prices.
4. Online Casino spam: Spam promoting gambling in online casinos.
5. Fake Degrees spam: Spammers often try to sell fake Degrees and Diplomas.
6. Mule job spam: Promoting jobs ‘working from home’ (which are typically scams, or mule jobs, like laundering money).
5.2. Financial spam
While advertisement spam have at least a little probability, that the responder could get something for the sent money, the financial spam only tries to fool people and get their money somehow, without the chance to buy anything. The most common financial spam kinds are the following:

1. 419 scams: Usually a plea for help to recover millions of dollars from a bank account in a foreign country (typically Nigeria).
2. Lottery spam: Similar to the 419 scam, these spam are telling, ‘You have already won X Million’ in order to try to extract transfer fees etc.
5.3. Phishing
Phishing spam is fake alert from banks (mostly CitiBank), PayPal, eBay etc, and it asks for confirmation, validation or monitoring of details in order to defraud people of their personal information. Phishing spam are usually linked to fake login sites, which can be used to capture user details (e.g. passwords) in order to use this information to steal money or goods. The term Phishing was coined because the fraudsters are “fishing” for personal information.
6. Image Spam
In November 2006, image spam accounted for up to 40 percent of the total spam received, compared to less than ten percent a year ago. Image spam has been significantly increasing for the last few months and various kinds of spam, typically pump-and-dump stocks, pharmacy and degree spam, are now sent as images rather than text. Image spam is typically three times the size of text-based spam, so this represents a significant increase in the bandwidth used by spam messages.
During the second half of 2006 rising volumes of image spam emerged as a significant problem. Email servers everywhere strained to cope with the extra volume and anti-spam filters struggled to maintain detection rates owing to high variability and advanced randomization techniques in the images. Image spam peaked at more than 50% in January 2007 but has since fallen to less than 20% - most likely as a result of improved image spam detection. Notable over the period was a reduction in stock image spam, whereas health image spam continued largely unabated.[3]

Figure 2: Image Spam June 2006 to June 2007[3]

6.1 .Image spam takes its toll on productivity
Image spam not only puts recipients at financial and personal risk, it also bogs down email servers and inhibits productivity. Image spam is typically three to four times the file size of text-based spam, so more server space is needed to store the messages, and bandwidth is reduced. Even if image spam is detected and then sent to a quarantine database, which is typically fixed in size, there’s the danger that these unwanted messages will clog up the server until they are deleted. For most organizations, it’s preferable to use an anti-spam product that can recognize the messages as spam and drop them at the mail gateway—before they get to the servers[5].
7. Image Spam Is Difficult To Detect
Image spam has been around for years. It was originally created in order to get past "heuristic" filters, which block messages containing words and phrases commonly found in spam. Since image files are in an entirely different format than the text found in an email, heuristic filters never "see" the content of the message. Therefore, these filters were easily defeated by this type of spam[5].
To deal with this problem, anti-spam vendors developed "fuzzy signature" technologies. These signature-based technologies collect samples of known spam and then classify "near-identical" messages as spam. These signatures were sometimes written against just the message attachment, so that messages with different content but the same attachment would still be marked as spam.
Signature-based defenses remained effective for several years. In 2006, however, spammers began randomizing images to appear the same to the human viewer but totally different to spam filters. For example, some spammers are sending messages advertising the purchase of stocks with an attached .gif file that has random "dots" inserted in the image and borders with subtly different color and width. The signatures that most anti-spam vendors rely on to detect these attacks vary dramatically, based on these small changes to the image. This means that anti-spam vendors may publish a rule that stops one instance, but this rule doesn't stop all the rest of the spam messages in the attack[5].
There is an almost infinite number of ways that spammers can randomize images. In addition to inserting dots, spammers have recently used techniques such as varying the colors used in an image, changing the width and pattern of the border, altering the font style, and "slicing" images down into smaller pieces (which are then reassembled to appear as a single image to the recipient[5].
Detecting image spam is a challenge for two main reasons:
1. Image spam messages are often organized in the same way as legitimate email messages.
2. Spammers randomly modify images to escape spam filters
7.1 Imitating legitimate email
In some cases, spammers try to fool anti-spam filters by organizing an image spam message to look like a legitimate message from popular email programs such as Microsoft® Outlook®, Outlook Express, or Thunderbird. When you prepare an email, you might type in your text and attach an image. Spammers do exactly the same thing when they prepare their messages. They include headers, subject lines, and complex text, just like any standard communication. But the subject lines are often nonsense constructed from random words that are unrelated to the content of the spam. Sometimes, image spam contains random body text in order to avoid detection, fake conversation threads to make the spam look like it was a reply to a previous mail, and passages from popular literature and random chunks of text in an attempt to bypass anti-spam text scanning[5].
8.2. Random changes
Resourceful image spammers resort to a varied palette of image modification techniques to evade spam filters, optical character recognition (OCR), and image scanning techniques that compare new image spam with known spam images. In addition to adding random noise to images, spammers often obfuscate text using various techniques or split an image into multiple smaller “tiled” images to make the entire image look like a jigsaw puzzle. A second flood of image spam may look exactly like its predecessor to the human eye, but the properties of the new image are completely different[5].
9. Why OCR fails
Deliberate obfuscation makes it more difficult to determine which pixels are text and which are background color or noise. Spammers often obfuscate the text by using raised lettering, different font sizes, and random colors that can make it hard even for humans to read.
The text in the preceding example is obfuscated. The human eye can see that the fourth word is supposed to be “BUY,” but character recognition may process the “U” as an “L” and a “J” due to the split in the middle. The “Y” may also be processed incorrectly, as it could be misread as an “X.” The background image is also obfuscated to make OCR scanning even more difficult, but the content is still readable by recipients. Spammers have also resorted to creating images with handwritten messages, which are almost impossible for OCR to read because it relies on font recognition. Because many image spam messages consist of text, it seems that OCR would be a logical choice for detection. But OCR is extremely time consuming, CPU intensive, and very easy to deceive. OCR isn’t 100 percent accurate even when scanning black-and white text, and is very inaccurate when spammers deliberately obfuscate the text. The more the text is disguised, the slower the processing of each image. This places an extraordinary cost and undue burden on valuable computing resources that, for some organizations, may already be operating at close to capacity[5].
10. Fingerprinting
Although there is considerable agreement in industry and the academic community that Bayesian content filtering systems offer the best accuracy rates in the classification of text based messages, in the early 2000s many large operators adopted a second broad class of antispam technologies. As in the case of classical OCR in processing of image email messages, this shift to fingerprinting was partly motivated by the computational requirements of first generation of Bayesian filters. A fingerprint or checksum-based filters exploited the fact that spam messages are sent in bulk. These antispam technology functions essentially by striping all context that may vary across messages, reduce what remains to a checksum or a fingerprint that defines that particular message within the population of all possible messages. Then, to allow the message to pass through to the end user, the system must compare the checksum with those collected in a centralized database of fingerprints. Hence, there is generally no analysis of the content in the messages [6].There are different methods for constructing such a centralized database in real time. Some commercial software produces for example employ feedback buttons on their email client which can be selected to nominate a particular message as spam. Within the system design, this nomination increases the probability of that similar messages are classified as spam within a centralized database. More advanced checksum filters used by ISPs today employ fuzzy fingerprinting techniques, which are able to identify outbreaks of new randomization and automation techniques developed by spammers that bypass simple fingerprint systems.
10.1 Smart spam
Since checksum systems function by identifying spam through monitoring the quantities of similar messages sent in real time, spammers have learned to deploy spam sending robots that are able to send messages that automatically change the shape of individual spam emails. As directed by semi-autonomous robots, the mass mailings that contain the so called hashbusters look different to the fingerprint systems currently in place, but convey the same message to the eyes of the end users. The capacity of spammers to easily send messages that appear unique hence leads to very low spam detection rate in basic fingerprint systems, and has adopted the adoption of fuzzy fingerprinting methods. Image spam is one particular form of smart spam.[6]
11. Proposed Work
The conclusion of the whole study is to make an effective high level conceptual model for the identification and removal of image spasm. Among the all present solutions, my view is to use neural network for its handling. Another challenge for ANN is its training my proposed model will be trained supervisingly with the help of off line genetic algorithm. Once it was trained properly, the prototype model will handle the problem more effectively and in a fail safe way. RY - JUNE 2007
12.Example of image spam


13.References
[1] Ending Spam - Bayesian Content Filtering and the Art of Statistical Language Classification by Jonathan A. Zdziarski
[2] A plan for spam by Paul Graham, http://www.paulgraham.com/spam.html
[3] Marshal Security Trends Spam, Phishing, Malware By Marshal Threat Research & Content Engineering Team
[4] Image Spam: The Email Epidemic of 2006 http://www.ironport.com/technology/ironport_image_spam.html
[5] Image Spam: The New Email Scourge by Nick Kely
[6] High Speed Image Part Recognitions By Partrik Ostrihon and Reza Rajabium


  Malik Muhammad Asif
SZABIST Karachi, Pakistan
 
 
   
 
 
 
StudiesInn.Com © 2008
All Rights Reserved.
Home
|
About Us
|
Our Courses
|
Articles
|
Contact Us