Need to improve urlencode function

Written by:

Need to improve urlencode function
  • 0.00 / 5 5
0 votes, 0.00 avg. rating (0% score)

I need a way to URL encode strings with a shell script on a OpenWRT device running old version of busybox.
Right now I ended up with the following code:

urlencode() {
echo "$@" | awk -v ORS="" '{ gsub(/./,"&n") ; print }' | while read l
do
  c="`echo "$l" | grep '[^-._~0-9a-zA-Z]'`"
  if [ "$l" == "" ]
  then
    echo -n "%20"
  else
    if [ -z "$c" ]
    then
      echo -n "$l"
    else
      printf %%%02X '"$c"
    fi
  fi
done
echo ""
}

This works more or less fine but there’re a few flaws:

  1. Some characters are skipped, like “”, for example.
  2. Result is return character by character so it’s extremely slow. It takes about 20 seconds to url encode just a few strings inside a batch.

I’m not very much into shell scripting in obscure shells (ash here) so I summon to collective wisdom to help me improve it so it works faster and doesn’t skip any characters.
I would appreciate an advice but please don’t offer substitution algorithms as I’m looking to support all characters not just a few special ones. Also, my version of bash doesn’t support substring like this ${var:x:y}.

Thanks!

Serg

[TL,DR: use the urlencode_grouped_case version in the last code block.]

Awk can do most of the job, except that it annoyingly lacks a way to convert from a character to its number. If od is present on your device, you can use it to convert all characters (more precisely, bytes) into the corresponding number (written in decimal, so that awk can read it), then use awk to convert valid characters back into literals and quoted characters into the proper form.

urlencode_od_awk () {
  echo "$1" | od -t d1 | awk '{
      for (i = 2; i <= NF; i++) {
        printf(($i>=48 && $i<=57) || ($i>=65 &&$i<=90) || ($i>=97 && $i<=122) ||
                $i==45 || $i==46 || $i==95 || $i==126 ?
               "%c" : "%%%02x", $i)
      }
    }'
}

If your device doesn’t have od, you can do everything inside the shell; this will significantly help performance (fewer calls to external program — none if printf is a builtin) and be easier to write correctly. I believe all Busybox shells support the ${VAR#PREFIX} construct to trim a prefix from a string; use it to strip the first character of the string repeatedly.

urlencode_many_printf () {
  string=$1
  while [ -n "$string" ]; do
    tail=${string#?}
    head=${string%$tail}
    case $head in
      [-._~0-9A-Za-z]) printf %c "$head";;
      *) printf %%%02x "'$head"
    esac
    string=$tail
  done
  echo
}

If printf is not a builtin but an external utility, you will again gain performance by invoking it only once for the whole function instead of once per character. Build up the format and parameters, then make a single call to printf.

urlencode_single_printf () {
  string=$1; format=; set --
  while [ -n "$string" ]; do
    tail=${string#?}
    head=${string%$tail}
    case $head in
      [-._~0-9A-Za-z]) format=$format%c; set -- "$@" "$head";;
      *) format=$format%%%02x; set -- "$@" "'$head";;
    esac
    string=$tail
  done
  printf "$format\n" "$@"
}

This is optimal in terms of external calls (there’s a single one, and you can’t do it with pure shell constructs unless you’re willing to enumerate all characters that need to be escaped). If most of the characters in the argument are to be passed unchanged, you can process them in a batch.

urlencode_grouped_literals () {
  string=$1; format=; set --
  while
    literal=${string%%[!-._~0-9A-Za-z]*}
    if [ -n "$literal" ]; then
      format=$format%s
      set -- "$@" "$literal"
      string=${string#$literal}
    fi
    [ -n "$string" ]
  do
    tail=${string#?}
    head=${string%$tail}
    format=$format%%%02x
    set -- "$@" "'$head"
    string=$tail
  done
  printf "$format\n" "$@"
}

Depending on compilation options, [ (a.k.a. test) may be an external utility. We’re only using it for string matching which can also be done within the shell with the case construct. Here are the last two approaches rewritten to avoid the test builtin, first going character by character:

urlencode_single_fork () {
  string=$1; format=; set --
  while case "$string" in "") false;; esac do
    tail=${string#?}
    head=${string%$tail}
    case $head in
      [-._~0-9A-Za-z]) format=$format%c; set -- "$@" "$head";;
      *) format=$format%%%02x; set -- "$@" "'$head";;
    esac
    string=$tail
  done
  printf "$format\n" "$@"
}

and copying each literal segment in a batch:

urlencode_grouped_case () {
  string=$1; format=; set --
  while
    literal=${string%%[!-._~0-9A-Za-z]*}
    case "$literal" in
      ?*)
        format=$format%s
        set -- "$@" "$literal"
        string=${string#$literal};;
    esac
    case "$string" in
      "") false;;
    esac
  do
    tail=${string#?}
    head=${string%$tail}
    format=$format%%%02x
    set -- "$@" "'$head"
    string=$tail
  done
  printf "$format\n" "$@"
}

I tested on my router (MIPS processor, DD-WRT-based distribution, BusyBox with ash, external printf and [). Each version is a noticeable speed improvement on the previous one. Moving to a single fork is the most significant improvement; it’s the one that makes the function respond almost instantly (in human terms) as opposed to after a few seconds for a realistic long URL parameter.

Leave a Reply