June 26, 2024, 04:10:32 AM

News:

Own IWBasic 2.x ? -----> Get your free upgrade to 3.x now.........


Searching (and replacing) in binary files

Started by Peter, March 24, 2009, 03:12:44 PM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

Peter

Excuse me if there are samples on how to do this, but I simply can't find them even though I've searched through the forums several times now.

I'm having a complete blackout and can't figure out how to search a huge binary file (between certain offsets, but that's not the issue) for certain bytes effectively.
I don't want to loop through the files one byte at a time since it would take way too much time to scan a huge file, and the strings wouldn't do it right since I might be stumbling onto nulls every now and then.
Since I'm looking for binary strings, the strings I am searching for might as well be containing nulls and that would be another issue. Could anyone give me some guidelines or a sample? I'm completeley stuck.  :-[

Thanks.

Ionic Wind Support Team

Peter,
The only efficient way of doing this is to load the file into memory and search 32 bits at a time.  (4 bytes = 1 UINT).  There are no shortcuts unless the binary file has a known format.

Paul.
Ionic Wind Support Team

Peter

That means that I will load a couple of kilobytes at a time, search for my sequence (for example 8 bytes) and then take the next couple of kilobytes (-7 bytes to not miss anything) and so on?

Typical scenario:
Open file
Read 5kb (5*1024) into an ISTRING (or memory allocated with allocmem?)
loop through searching for the beginning of my sequence, in this case 25 bytes, (using ISTRING[position] or for example #<CHAR>pointer[position] if using allocmem)
if found, search for the rest of the string
if not string is found Seek to start of next 5kb chunk minus 24 bytes (in case my sequence happens to begin in the end of the chunk)
read, and repeat all over again

Something like that?
Sorry, but I'm really not used to this since my previously used basic-compilers normally didn't use null-terminated strings.

Ionic Wind Support Team

Peter,
Without more information I can't guide you.

You said it was a binary file, not an ASCII one, what is the format of the file?  What program generated it?

Why would you only want to load 5K at a time?  Load the entire file into memory and search for the bytes you are looking for one UINT at a time.  Use NEW and DELETE to allocate memory for the search.

Paul.
Ionic Wind Support Team

Peter

Sorry for the lack of information. I ment a general function for binary files containing random data of any size. In this particular case, I will be scanning through an executable as a test, but I want it to support files up to sizes of dvd-isos if needed. The string I'll be searching for will also be of varying size, but probably never larger than a few kilobytes.

Ionic Wind Support Team

The only way to scan it will be one byte at a time then.  But you can make it more efficient by reading 4 bytes at once, which I guess wasn't clear when I first mentioned it.

You have a pointer to memory allocated by NEW which contains the file, or a portion of it.  The least efficient way is to do a byte by byte comparison to your search object.  I'll call it an object because it doesn't matter if it is a string, or sequence of bytes from an image, etc.

A byte by byte search usually uses two loops, one to find the starting byte, and another to compare remaining bytes

"length1" is the length of the memory to be searched, "length2" is the length of the item you are searching for.
pMem is the memory pointer containing the loaded file, and pTarget is the item you are searching for:


nTemp = length2
ptemp = pMem
while length1
s1 = pTemp
s2 = pTarget
        while (#<char>s1 = #<char>s2) AND nTemp
              s1++:s2++
              nTemp--
        endwhile
        if nTemp = 0
            'target found at location pTemp
        else
            nTemp = length2: 'reset for next test
        endif
  length1--
  pTemp++
endwhile


Obviously the code is just for illustration.  It is how most INSTR commands are coded in BASIC languages. The inefficient part comes from having to access 1 byte at a time in the middle of dword boundaries, which causes the processor to generate a memory exception in order to read the byte.  If you change it to read one UINT at a time, and search that UINT for the starting byte using an AND mask you can speed up the search significantly.  You would still need to do the byte by byte comparison once you have a starting point.

To create an AND mask take the first byte of the item you are searching for and put it into all 4 bytes of a UINT. 


byte = #<char>ptarget
UINT mask = (byte << 24) | (byte << 16) | (byte << 8) | byte

nTemp = length1
pTemp = pMem
while length1 > 4
    if (#<UINT>pTemp & mask) <> 0
        'pTemp contains a valid starting point
         'use the code above to further search byte by byte using pTemp as the starting point and the current value of length1
          'but only search 4 times in the outer loop.
    endif
    pTemp += 4
    length1 -= 4
endwhile   


What this does is first search the source 4 bytes at a time for a beginning byte that matches the first byte of your search item.  Which keeps memory reads initially aligned on even dword boundaries and can save up to 12 clock cycles per loop.  Only beneficial if you are talking about multiple megabytes worth of data you are searching.  If the file size is less then say 64K then you probably could get away with just an INSTR type search.

I normally write such search routines in assembler, so the above might be a little vague.  Experiment.

Paul.
Ionic Wind Support Team

Peter

Great answer, thank you. I had no idea I could use INSTR() on a string that could contain NULL's. Very helpful indeed. Always to the rescue.