Always Remember: With Gusto Comes Data Loss.

Introduction to Regular Expressions (Regex)

This next subject might seem a bit obscure, but I promise you, this guide will benefit you significantly if you work in IT/Security long enough. I’m going to be discussing what is usually referred to as a regular expression, or regex. Regex is a very powerful way to search through massive text files for exactly what you need. The problem is, it looks like gibberish until you know exactly how it works.

Manipulating Text in Linux

Remember, nearly everything in Linux is a file, and for that matter, most are simple text files. Unlike Windows, with elaborate snap-ins and MMCs to configure an application or server, Linux simply has a text file for configuration. Change the text file, change the configuration. As a result, early pioneers in Linux developed some rather elaborate and elegant ways to manipulate text.

For instance, what if we we’re looking for a line of code among millions of lines of code that began with an “s” containing only the letters “sugr” and the numbers 1-5 with a “bb” ending? Could we find it without having to go through millions of lines of code? Using regex we can do exactly that.

The Importance of Learning Regex

Regex is implemented throughout the IT world. First developed in 1956, it has now found its way into Java, Ruby, PHP, Perl, Python, MySQL, Apache, .NET, and of course, Linux.

Without understanding regex, you’re not only hamstrung in scripting any of these languages, but your ability to do more than simple search and replaces becomes very tedious. In addition, many of the rules written into Snort and other intrusion detection systems are written in regex.

As you can imagine, if searching for some malicious code, the ability to search and find sophisticated and complex text patterns is crucial.

How Regex Works in a Security Environment

Step 1 – A Snort Rule

Snort is one of the many applications and scripting languages that use regular expressions. With its ability to detect just about any type of attack, Snort would be crippled without its regex capabilities. Let’s take a look at a rule to detect the Ransomware attacks that were seen across the world:

alert tcp $HOME_NET any -> $EXTERNAL_NET $HTTP_PORTS (msg:”MALWARE-CNC Win.Ransomware.PRISM outbound connection attempt – Get lock screen”; flow:to_server,established; content:”GET”; http_method; content:”/page/index_htm_files2/”; nocase; fast_pattern:only;

pcre:”/\x2f((xr)_a-z)|[0-9]{3,}\x2e(css|js|jpg|png|txt)$/U”;

http_uri; metadata:impact_flag red, policy balanced-ips drop, policy security-ips drop, ruleset community, service http; reference:url,http://www.virustotal.com/en/file/417cb84f48d20120b92530c489e9c3ee9a9deab53fddc0dc153f1034d3c52c58/analysis/1377785686/; classtype:trojan-activity; sid:1000033; rev:3;)

What we are interested in right now is the part marked in bold. This is the part of the rule that is utilizing pcre to detect the ransomware.We’ll come back to this rule in a later guide, but for now, let’s look at a simpler Snort rule using regular expressions. If you look closely at these two, you will probably be able to identify some similarities between the simple rule below and the Snort rule above.For our example, let’s use this following pseudo-rule:alert tcp any any -> any 80 ( pcre:”/\/foo.php?id=[0-9]{1,10}/”;)

The first part of this rule should be familiar to anyone who has worked with firewalls or IDS systems. It says “send an alert when a packet comes across the wire using the tcp protocol from any IP address from any port to any IP address to port 80“. It’s what comes after this that is new and strange.

Our task now, is to figure out what this rule is looking for.

Step 2 – Some Basic Syntax

Before we begin to attempt to decipher what that rule is looking for, let’s lay out basic and simple regular expression syntax and rules.

  • / – Begins and ends a regular expression.
  • . – Matches any single character.
  • [ ] – Matches a single character within the brackets.
  • [^] – Matches everything except what is in-between the brackets (and after the ^).
  • [x-y] Matches every character or number in-between m & n (ex: [a-d] will match the letters a,b,c, or d and [2-7] will match the numbers 2,3,4,5,6, and 7. They are case sensitive by default, and can be combined however you like. For example, to match any alphanumeric character, you can use [A-Za-z0-9]).
  • ^ – Matches the starting position of the string.
  • * – Matches the preceding element or group zero or more times.
  • $ – Matches the ending position of the string.
  • ( ) – Defines an expression or group.
  • {n} – Matches the preceding character n times (ex: {5} would require the preceding character or group to match 5 times).
  • {m,n} – Matches the preceding element at least m times and not more than n times (ex: {2,4} would require the preceding character or group to appear 2-4 times in a row).
  • | – Matches the character or group either before OR after the |.

Step 3 – Interpreting the Rule

That summarizes some of the very basic rules of regular expressions. Let’s try breaking down the regular expression built into the simple Snort rule above and try to determine what it is looking for.

pcre:”/\/foo.php?id=[0-9]{1,10}/”;

  • pcre: – This simply tells the Snort engine to start using Perl Compatible Regular expressions on everything that follows.
  •  – Indicates the beginning of the content.
  • / – Indicates the beginning of the subexpressions that the PCRE is looking for.
  • \ – This is an escape character—it says “don’t use the special meaning that the following character has in pcre,” but instead see it as literal character. This is necessary because the next part of the rule starts with “/” which already has a meaning in regex.
  • /foo.php?id= – This is simple text—the rule is looking for this set of characters.
  • [0-9] – The brackets here indicate look for any of the digits between 0-9.
  • {1,10} – The curly braces here say look for the previous digits between 1 and 10 times.
  • – End the expression we are searching for.

We could then interpret this rule in English to say, “look for (presumably a URL) that end with “foo.php?id=” and then has a single digit between 0 and 9 and that digit can be repeated between 1 and 10 times.”

This rule would then catch packets with:

  • foo.php?id=1
  • foo.php?id=3
  • foo.php?id=33
  • foo.php?id=333333

But would pass packets with:

  • bar.php?id=1    bar instead of foo
  • foo.php?id=    must have at least one digit
  • foo.php?id=A    must have a digit not an alphabetic
  • foo.php?id=11111111111    can only have between 1 and 10 digits after the =

These are the basics of Regex. I’ll be writing a more advanced guide later, but this should be enough to get you rolling for now. Regex is honestly something that you just have to work with over and over again to get the hang of. Given time, something as incomprehensible as the advanced rule at the start of this guide will be as easy for you to read as the rest of this page.

Leave a comment

Your email address will not be published. Required fields are marked *