An interesting little difference in file handling ...


8 posts
by kaspencer » Mon Feb 04, 2013 11:28 am
Greetings all ...

I have encountered an interesting difference in the way files are handled between using Windows as a client PC with either a Windows Server 2003 R2 as server or a Raspberry Pi with Raspbian Wheezy and Samba as a server. In both cases the client sees the data via mapped drives.

First, by way of background: I have several business and community websites hosted in the Linx centre (Canary Wharf). Although the webstatistics for the sites are available to be via Webalizer, I actually FTP the webstat log files down to my server daily, for more detailed analysis. Recently, I transferred responsibility for the management and processing of the daily files to one of my Raspberrys Pi which functions as a domain controller and fileserver.

The files are FTP'd to the server, and are then renamed as the date of retrieval as YYYYMMDD.txt. The Win2k3 server did this by a small C-compiled program, whereas the Raspberry Pi does it by a script. But the effect on the filename is absolutely identical in each case. All files have the same time, of 06:00am exactly.

Each month, I concatenate the 30 or 31 files into a single file. This is done at a Windows Command Window by navigation to the directory on the mapped drive, and typing the following command, for example, for January 2013:
Code: Select all
type 201301??.txt >> 201301.txt

So here is the "interesting little difference" which I have noted:
When this process was managed at mapped drive on the Windows Server 2003 R2 server, the effect was to take the files in exact date order and write the contents into the new file for the month.
When the process is managed at a mapped drive on the Raspberry Pi, the files are NOT written into the new monthly file in exact date order: rather there is always one file which is written first, usually it is that from the 8th, 9th or 10th of the month. That file is then followed by the other files in the correct date order.

Does anyone have an explanation? Remember that the files are all named by the date, one file per day for the month, and that they all have the same time, exactly.

(Note that because of this little issue, the files are now concatenated into the monthly file at the time of retrieval, so they are put into the monthly file in the correct order!)

All the best,

Kenneth Spencer
2x256Mb + 2x512Mb RPi, Eth'netLAN+Win2k3R2 svr, 40MbpsFTTC.
RaspBMC: 128G SD, K400 wl Kb+TPd, HD32tv&Rem.
RW'y webserver: 64G SD, Y-RK49 wl Kb+M, HannsG W24" screen
RW'y PDC & fileserver: as above + 2TB disc.
+RiscOSPi on 32G uSD.
Posts: 71
Joined: Wed Mar 07, 2012 11:37 pm
Location: UK, England, Wiltshire
by joan » Mon Feb 04, 2013 12:38 pm
Different collation order by your Window's versions.
User avatar
Posts: 5482
Joined: Thu Jul 05, 2012 5:09 pm
Location: UK
by kaspencer » Mon Feb 04, 2013 6:02 pm
OK, Joan ... let us consider it further:

1. The same client PC was used in both instances, so it cannot be " ... collation order by [...] Windows versions", in the client, as there was only one Windows client involved;
2. Only one Windows Server version was in use, so if it is collation order it is collation order not between Windows version, but between Windows Server 2k3 and Raspbian Wheezy/Samba.

So if we modify your statement a little, so that it reads "Different collation order, in serving the files, between Windows 2k3 Server and Raspbian Wheezy+Samba" we must examine why the filenames might collate differently. The filenames are:
20130101.txt
...
20130109.txt
20130110.txt
20130111.txt
...
20130119.txt
20130120.txt
20130121.txt
...
20130129.txt
20130130.txt
20130131.txt

All files are timed at 06:00am on the appropriate day and creation, modification, and access dates are the same as the filenames imply. There are no alphabetic characters in the filenames that might affect the collation order.

Why on earth would 20130108.txt collate first in the list, followed by 20130101.txt - 20130131.txt in the expected numerical order? What possible collation method could produce that effect?

Educate me eh!

I am looking forward to your reply with considerable interest.

Ken.
2x256Mb + 2x512Mb RPi, Eth'netLAN+Win2k3R2 svr, 40MbpsFTTC.
RaspBMC: 128G SD, K400 wl Kb+TPd, HD32tv&Rem.
RW'y webserver: 64G SD, Y-RK49 wl Kb+M, HannsG W24" screen
RW'y PDC & fileserver: as above + 2TB disc.
+RiscOSPi on 32G uSD.
Posts: 71
Joined: Wed Mar 07, 2012 11:37 pm
Location: UK, England, Wiltshire
by terrycarlin » Mon Feb 04, 2013 9:59 pm
Your method of retrieval does not guarantee an order. The files would be retrieved in the order that they appear in the directory structure, not in date or name order. Files are not automatically put in the directory in any order.

To force a specific order to the files you can use a change to your script.
Make note that you will be using the "backquote" character. It it means to run the command and gather the output in place.

Code: Select all
cat `ls 201301??.txt` >> 201301.txt


To be really sure of the order you can use this one.

Code: Select all
cat `ls -U 201301??.txt` >> 201301.txt


The -U argument to the ls command (check out the man page for "ls") indicates to return a directory listing in the using the creation date as the order.

This sorted list is used as an argument list to the cat command.

There are about a million other ways to do what you need.
If it ain't broke, take it apart and see how it works.
User avatar
Posts: 70
Joined: Thu Jun 14, 2012 10:42 pm
by kaspencer » Tue Feb 05, 2013 12:25 am
Thanks Terry - very constructive ...

There are still a couple of unanswered points in this discussion:

1. The files are definitely placed in the system in date order, each file is created on the date which it bears in its filename. Ordinarily, this, being the order of creation, modification and access (ie they are all the same), is the order in which the files will be listed by either filing system (Linux or Windows). As far as I know this is common to Linux (although the 2TByte disc is formatted in NTFS and mounted under NTFS-3G) and to Windows.

2. The command is actually executed on the Windows client - and it was thus when the server ran on Windows Server 2k3 R2, and remains so now under Raspbian Wheezy with Samba. Thus, in the Windows Command Windows the command is actually
Code: Select all
TYPE 201301??.txt >> 201301.txt
rather than the Linux command, "cat".

I do not know whether it is possible to force Windows to pass the equivalent of your "-U" argument to the server from a Windows command line as I know of no argument to the TYPE command that would have that effect. Furthermore, when the Windows command "DIR" is passed from the client to the server, Linux returns the files in the expected order (ie that of creation, date or filename, which should all result in the same order.).

I would be most interested to see any further comments and views. I should also say that the investigation is somewhat academic now because the concatenation is done immediately after the FTP script on the day of file retrieval, so that the correct order is guaranteed. Nevertheless it remains an interesting little issue that would be nice to resolve!

Many thanks

Ken
2x256Mb + 2x512Mb RPi, Eth'netLAN+Win2k3R2 svr, 40MbpsFTTC.
RaspBMC: 128G SD, K400 wl Kb+TPd, HD32tv&Rem.
RW'y webserver: 64G SD, Y-RK49 wl Kb+M, HannsG W24" screen
RW'y PDC & fileserver: as above + 2TB disc.
+RiscOSPi on 32G uSD.
Posts: 71
Joined: Wed Mar 07, 2012 11:37 pm
Location: UK, England, Wiltshire
by terrycarlin » Tue Feb 05, 2013 5:59 pm
Found some info that may help with understanding this.
Keep in mind when reading this article that when files are created in Unix/Linux directories that the file names are stored in the first available directory entry.

http://www.linuxquestions.org/questions/linux-general-1/samba-file-list-ordering-748967/

For extra credit, try googling Unix / Linux inode
If it ain't broke, take it apart and see how it works.
User avatar
Posts: 70
Joined: Thu Jun 14, 2012 10:42 pm
by Jim Manley » Wed Feb 06, 2013 5:09 am
What people not familiar with low-level OS technology are not aware of is that there is a whole lotta stuff going on under the covers that's not intuitively obvious to the most casual of observers. One of the things that's going on is creation and deletion of a surprising number of files and parts of files (as sectors) on a 24 x 7 basis. This includes various log files, temp files, cached files if a web and/or proxy server is running, movement of sectors during automatic defragmentation, and tons of other bits and pieces. If virtual memory is handled within a file rather than its own partition, there may be at least an order of magnitude more sectors being read and written from/to user-space partition(s) on storage device(s).

Things can get much more complicated if various extensions and enhancements are implemented in the filesystem portion of the OS. Media servers typically have radically different sector reading and writing schemes compared to typical filesystems, where sectors are written at fixed intervals across the storage device(s) as media can tend to be of predetermined durations (especially broadcast radio and video that is produced in chunks made up of 10, 15, or 30-minute multiples). This is done to maximize throughput at a guaranteed minimum data transfer rate. Such media is never written in contiguous sectors of more than a handful at a time to account for disk rotation rate, disk cache RAM size, and a bunch of other factors. OS process-related files are also interleaved at predetermined intervals among the media-related sectors to ensure that there won't be interference between OS and media file reads/writes. Similar techniques are used for real-time systems to guarantee specific levels of file access performance.
The best things in life aren't things ... but, a Pi comes pretty darned close! :D
"Education is not the filling of a pail, but the lighting of a fire." -- W.B. Yeats
In theory, theory & practice are the same - in practice, they aren't!!!
User avatar
Posts: 1356
Joined: Thu Feb 23, 2012 8:41 pm
Location: SillyCon Valley, California, USA
by Bakul Shah » Wed Feb 06, 2013 9:36 pm
kaspencer wrote:Each month, I concatenate the 30 or 31 files into a single file. This is done at a Windows Command Window by navigation to the directory on the mapped drive, and typing the following command, for example, for January 2013:
Code: Select all
type 201301??.txt >> 201301.txt

So here is the "interesting little difference" which I have noted:
When this process was managed at mapped drive on the Windows Server 2003 R2 server, the effect was to take the files in exact date order and write the contents into the new file for the month.
When the process is managed at a mapped drive on the Raspberry Pi, the files are NOT written into the new monthly file in exact date order: rather there is always one file which is written first, usually it is that from the 8th, 9th or 10th of the month. That file is then followed by the other files in the correct date order.

Show us the commands you use on both. If you used "cat 201301??.txt > 201301.txt" on linux, both should behave identically. Note I used ">" and not ">>" (append). In case of append, if the file existed and was not of zero length, it won't be truncated to zero length and any old stuff may confuse you. Try "wc 201301??.txt". The final "total" line should match what you get when you do "wc 201301.txt". There is no magic to this.
Posts: 293
Joined: Sun Sep 25, 2011 1:25 am