Basant Kukreja

pageicon Friday Aug 15, 2008

Using mod_sed to filter web content in Apache

Using mod_sed to filter Web Content in apache mod_sed is a apache module which filters the web content using powerful sed commands whether is generated by php, jsp or a plain html. Basic configuration information can been seen from the README. In this blog, I will cover how cryptic but powerful sed commands can be used inside apache.

Using branches "b" to implement if/else type of code
Suppose I want to write
if (line contains "a") then
   replace "x" with "y"
else
   replace "y" with "x"
fi
If I want to write above logic using "goto" syntax then I can write something like (pseudo code ) :
if (line contains "a") go to :ifpart
# else part
   replace "y" with "x"
   go to :end
:ifpart
   replace "x" with "y"
:end
In sed we can use the branch command "b" which is equivalent of goto. Here is the sed equivalent code :
/a/ b ifpart
s/y/x/g
b end
:ifpart
s/x/y/g
:end

$ cat one.txt
ax
xyz
$ /usr/ucb/sed -f one.sed < one.txt
ay
xxz
We can write the same example in apache :
OutputSed "/a/ b ifpart"
OutputSed "s/y/x/g"
OutputSed "b end"
OutputSed ":ifpart"
OutputSed "s/x/y/g"
OutputSed ":end"


Using hold buffer "h" as a buffer to save current text
Let's say I have a text :
It is Sunday today.
And I want replace it with two lines :
It is Monday today.
It is Sunday today.
So I want to do the following (pseudo code)
saveline=curline
replace Monday with Sunday.
curline = curline + saveline
print curline
In sed, we will write something like :
# hold the buffer
h
s/Sunday/Monday/
# Append the hold buffer to current text.
G
Sed's G command append the hold buffer into the current line (Pattern space). Inside apache, we can do the same thing using OutputSed directives :
OutputSed "h"
OutputSed "s/Sunday/Monday/"
OutputSed "G"


Multiline expression using hold buffer and commands "N", "x", "h" and "H"
Sed is very powerful to handle multi line text manipulation. Suppose, I have a condition which says :
'If a line contain "Sunday" and next line contain "Monday" then replace "Sunday" in first line to "Monday" and replace "Monday" to "Tuesday" in second line.'
As a example, I have a text :
It is Sunday today.
Tomorrow will be Monday.
The output should look like :
It is Monday today.
Tomorrow will be Tuesday.
So I want to do the following (pseudo code)
search for Sunday in current line
if found then 
    saveline=curline
    Read next line into curline
    search for Tuesday in second line
    if found then 
        swap curline and readline
        replace Sunday to Monday in curline
        swap curline and readline again.
        replace Monday to Tuesday in curline
        saveline = saveline + curline
        curline = saveline
    end innerif
end outerif
Next line can be read by "N" command.
swap functionality is provided by "x" sed command.
Appending saveline with curline is provided by "H" command.
replacing "curline" with "saveline" is provided by "g" command.
Overall sed script will look like :
/Sunday/ {
# save the current line in hold buffer
h
# Delete the content of the current line.
s/.*//
# Read next line.
N
# Delete first new line character (from previous line)
s/^.//
# Search for Monday in next line.
    /Monday/ {
# Exchange hold buffer from current line
        x
# Now current line contain 1st line so replace Sunday with Monday.
        s/Sunday/Monday/
# Exchange hold buffer from current line
        x
# Now current line contain 2nd line so replace Monday with Tuesday.
        s/Monday/Tuesday/
# Append hold buffer (1st line) with 2nd line.
        H
# Replace hold buffer with current line
        g
    }
}
Inside apache httpd.conf, I will write the equivalent sed script as following :
OutputSed "/Sunday/ {"
OutputSed "h"
OutputSed "s/.*//"
OutputSed "N"
OutputSed "s/^.//"
OutputSed     "/Monday/ {"
OutputSed         "x"
OutputSed         "s/Sunday/Monday/"
OutputSed         "x"
OutputSed         "s/Monday/Tuesday/"
OutputSed         "H"
OutputSed         "g"
OutputSed     "}"
OutputSed "}"
Above example shows how powerful sed commands can be used to filter web content (whether it is generated by html or php or jsp). Details of the sed can be obtained from sed man page
Comments:

Great, but how the heck do I get mod_sed? I haven't been able to find it for download anywhere, including your blog.

Posted by Benjamin Weiss on February 03, 2009 at 10:41 AM PST #

mod_sed is part of apache trunk.

If you are not using apache from trunk then
you can compile it for apache 2.2. It perfectly works with apache 2.2. Checkout the trunk :

Take the following 4 files from trunk (modules/filters directory) :
mod_sed.c sed0.c sed1.c regexp.c

And compile for apache 2.2 using apxs
apxs -c mod_sed.c sed0.c sed1.c regexp.c

Posted by Basant Kukreja on February 03, 2009 at 10:59 AM PST #

Here is the url for these files :
http://svn.apache.org/repos/asf/httpd/httpd/trunk/modules/filters/

Posted by Basant Kukreja on February 03, 2009 at 11:01 AM PST #

I just got a few problems with that installation of mod_sed on my CPanel Server with Apache 2.2

apxs -c mod_sed.c sed0.c sed1.c regexp.c always gives me an error that the files are not found, so where do I have to copy the 4 files?

Hope you could help with that problem.

Posted by mike on February 28, 2009 at 01:11 AM PST #

I have already written above that these are part of http trunk.
You can get these files from :
http://svn.apache.org/repos/asf/httpd/httpd/trunk/modules/filters/

Posted by Basant Kukreja on February 28, 2009 at 05:24 PM PST #

I did the following, I installed the RPM from http://www.atomicorp.com/channels/atomic/centos/5/x86_64/RPMS/ and then tried to work with the apxs, but it always returned me that the files are not found.

I am relative new to linux so would be great if you could explain it a bit more easier for me.

Posted by mike on March 01, 2009 at 02:47 AM PST #

All right, I solved the problem after recompiling my apache once more and start from the beginning.

Now I have one more final question. I tried to use regular expression, but they are not excepted. So I used the following command with mod_substitute where it worked:

Substitute 's|<body?(.*[^>])>|$0MY OWN CODE|g'

So I added my own code after any type of <body>, but with mod_sed nothing happens, have I missed something?

Posted by mike on March 01, 2009 at 06:55 AM PST #

Found one more question.

I just wanted to use the sed filter for all file types. So in mod_substitute the following works:

AddOutputFilterByType SUBSTITUTE text/html

So that command works for html and php and so on, but mod_sed is not accepting that. So for mod_sed, every file ending has to be added like that ?

AddOutputFilter Sed php php4 php5 html .......

Is there a way to use it for all processed files?

Posted by mike on March 01, 2009 at 09:07 AM PST #

AddOutputFilterByType Sed text/html

Posted by try this on March 02, 2009 at 06:02 AM PST #

Hi,

mod_sed seems to be exactly what I need, but I'm having a hard time getting it set up. Specifically, when I try to compile, I get a whole screenful of errors that looks like:

apxs -c mod_sed.c regexp.c sed0.c sed1.c
/usr/lib/apr-1/build/libtool --silent --mode=compile gcc -prefer-pic -O2 -g -march=i386 -mcpu=i686 -DLINUX=2 -D_REENTRANT -D_GNU_SOURCE -D_LARGEFILE64_SOURCE -pthread -I/usr/include/httpd -I/usr/include/apr-1 -I/usr/include/apr-1 -I/usr/include/mysql -c -o mod_sed.lo mod_sed.c && touch mod_sed.slo
`-mcpu=' is deprecated. Use `-mtune=' or '-march=' instead.
mod_sed.c:1: error: syntax error before '<' token
mod_sed.c:19:29: warning: character constant too long for its type
mod_sed.c:20:27: warning: character constant too long for its type
mod_sed.c:21:36: warning: character constant too long for its type
mod_sed.c:22:27: warning: character constant too long for its type
mod_sed.c:29: error: stray '#' in program
...
mod_sed.c:235: error: syntax error before '<' token
apxs:Error: Command failed with rc=65536

Any idea why this is happening? I'm running httpd 2.2.8 on Fedora Core 4. Thanks for your help!

-Dan Delany

Posted by Dan Delany on March 03, 2009 at 01:58 PM PST #

I've solved the problem I mentioned above, here's what was happening in case anyone else runs into it... I was getting my code from http://src.opensolaris.org/source/xref/webstack/mod_sed/ with wget (eg. wget http://src.opensolaris.org/source/xref/webstack/mod_sed/sed0.c). This does NOT work, as this is not actually a C file, but a generated HTML file...

Once I grabbed the files from http://svn.apache.org/repos/asf/httpd/httpd/trunk/modules/filters/ I was able to compile correctly. One caveat to the instructions above: In addition to the source files mentioned (mod_sed.c sed0.c sed1.c regexp.c), I also needed the header files to compile (libsed.h sed.h regexp.h). Once I got those files, the .so file was correctly created in the .libs directory. Thanks for a great module!

-Dan

Posted by Dan Delany on March 03, 2009 at 02:16 PM PST #

Very usefull....
I am having some trouble replacing characters in an input filter by their HEX code , can anyone provide the syntax please?

G.

Posted by G. on April 04, 2009 at 07:35 AM PDT #

mod_sed is now integrated into opensolaris. Users can download mod_sed from :
http://src.opensolaris.org/source/xref/sfw/usr/src/cmd/apache2/modules/mod_sed.tar.gz

Posted by Basant on May 27, 2009 at 04:29 PM PDT #

You can also checkout the mod_sed from the bitbucket repository :
hg clone http://bitbucket.org/basantk/mod_sed/

Posted by Basant Kukreja on September 16, 2009 at 03:33 PM PDT #

Fixed mod_sed bug 48024
https://issues.apache.org/bugzilla/show_bug.cgi?id=48024#c3

(Also available from bitbucket repository)

Posted by Basant Kukreja on October 19, 2009 at 01:59 PM PDT #

Post a Comment:
  • HTML Syntax: NOT allowed

« October 2009
SunMonTueWedThuFriSat
    
1
2
3
4
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
27
28
29
30
31
       
Today

Feeds

Search this blog

Links

Weblog menu

Today's referrers

Today's Page Hits: 149


View My Stats