Manipulating Text in Linux
Remember, nearly everything in Linux is a file, and for that matter, most are simple text files. Unlike Windows, with elaborate snap-ins and MMCs to configure an application or server, Linux simply has a text file for configuration. Change the text file, change the configuration. As a result, early pioneers in Linux developed some rather elaborate and elegant ways to manipulate text.
For instance, what if we we’re looking for a line of code among millions of lines of code that began with an “s” containing only the letters “sugr” and the numbers 1-5 with a “bb” ending? Could we find it without having to go through millions of lines of code? Using regex we can do exactly that.
The Importance of Learning Regex
Regex is implemented throughout the IT world. First developed in 1956, it has now found its way into Java, Ruby, PHP, Perl, Python, MySQL, Apache, .NET, and of course, Linux.
Without understanding regex, you’re not only hamstrung in scripting any of these languages, but your ability to do more than simple search and replaces becomes very tedious. In addition, many of the rules written into Snort and other intrusion detection systems are written in regex.
As you can imagine, if searching for some malicious code, the ability to search and find sophisticated and complex text patterns is crucial.
How Regex Works in a Security Environment
Step 1 – A Snort Rule
Snort is one of the many applications and scripting languages that use regular expressions. With its ability to detect just about any type of attack, Snort would be crippled without its regex capabilities. Let’s take a look at a rule to detect the Ransomware attacks that were seen across the world:
alert tcp $HOME_NET any -> $EXTERNAL_NET $HTTP_PORTS (msg:”MALWARE-CNC Win.Ransomware.PRISM outbound connection attempt – Get lock screen”; flow:to_server,established; content:”GET”; http_method; content:”/page/index_htm_files2/”; nocase; fast_pattern:only;
pcre:”/\x2f((xr)_a-z)|[0-9]{3,}\x2e(css|js|jpg|png|txt)$/U”;
http_uri; metadata:impact_flag red, policy balanced-ips drop, policy security-ips drop, ruleset community, service http; reference:url,http://www.virustotal.com/en/file/417cb84f48d20120b92530c489e9c3ee9a9deab53fddc0dc153f1034d3c52c58/analysis/1377785686/; classtype:trojan-activity; sid:1000033; rev:3;)
The first part of this rule should be familiar to anyone who has worked with firewalls or IDS systems. It says “send an alert when a packet comes across the wire using the tcp protocol from any IP address from any port to any IP address to port 80“. It’s what comes after this that is new and strange.
Our task now, is to figure out what this rule is looking for.
Step 2 – Some Basic Syntax
Before we begin to attempt to decipher what that rule is looking for, let’s lay out basic and simple regular expression syntax and rules.
- / – Begins and ends a regular expression.
- . – Matches any single character.
- [ ] – Matches a single character within the brackets.
- [^] – Matches everything except what is in-between the brackets (and after the ^).
- [x-y] Matches every character or number in-between m & n (ex: [a-d] will match the letters a,b,c, or d and [2-7] will match the numbers 2,3,4,5,6, and 7. They are case sensitive by default, and can be combined however you like. For example, to match any alphanumeric character, you can use [A-Za-z0-9]).
- ^ – Matches the starting position of the string.
- * – Matches the preceding element or group zero or more times.
- $ – Matches the ending position of the string.
- ( ) – Defines an expression or group.
- {n} – Matches the preceding character n times (ex: {5} would require the preceding character or group to match 5 times).
- {m,n} – Matches the preceding element at least m times and not more than n times (ex: {2,4} would require the preceding character or group to appear 2-4 times in a row).
- | – Matches the character or group either before OR after the |.
Step 3 – Interpreting the Rule
That summarizes some of the very basic rules of regular expressions. Let’s try breaking down the regular expression built into the simple Snort rule above and try to determine what it is looking for.
pcre:”/\/foo.php?id=[0-9]{1,10}/”;
- pcre: – This simply tells the Snort engine to start using Perl Compatible Regular expressions on everything that follows.
- “ – Indicates the beginning of the content.
- / – Indicates the beginning of the subexpressions that the PCRE is looking for.
- \ – This is an escape character—it says “don’t use the special meaning that the following character has in pcre,” but instead see it as literal character. This is necessary because the next part of the rule starts with “/” which already has a meaning in regex.
- /foo.php?id= – This is simple text—the rule is looking for this set of characters.
- [0-9] – The brackets here indicate look for any of the digits between 0-9.
- {1,10} – The curly braces here say look for the previous digits between 1 and 10 times.
- / – End the expression we are searching for.
We could then interpret this rule in English to say, “look for (presumably a URL) that end with “foo.php?id=” and then has a single digit between 0 and 9 and that digit can be repeated between 1 and 10 times.”
This rule would then catch packets with:
- foo.php?id=1
- foo.php?id=3
- foo.php?id=33
- foo.php?id=333333
But would pass packets with:
- bar.php?id=1 bar instead of foo
- foo.php?id= must have at least one digit
- foo.php?id=A must have a digit not an alphabetic
- foo.php?id=11111111111 can only have between 1 and 10 digits after the =
These are the basics of Regex. I’ll be writing a more advanced guide later, but this should be enough to get you rolling for now. Regex is honestly something that you just have to work with over and over again to get the hang of. Given time, something as incomprehensible as the advanced rule at the start of this guide will be as easy for you to read as the rest of this page.