Opened 8 years ago
Closed 8 years ago
#658 closed defect (fixed)
ASCII reader very broken
Reported by: | butler | Owned by: | krzywon |
---|---|---|---|
Priority: | critical | Milestone: | SasView 4.0.0 |
Component: | SasView | Keywords: | |
Cc: | Work Package: | SasView Bug Fixing |
Description
As reported by Steve king on ticket #614 if
Any non-I(Q) data ASCII files that just happen to have space or tab delimited numbers in the opening lines then these get loaded as data (the first pair of numbers as X & Y). So examples of this include BSL/OTOKO 'header' files, eg: this CORFUNC file had the extension .CF1
201 1 1 1 0 0 0 0 0 0
C:\Temp\X2
where it took the 201 & 1 as X & Y, or perhaps a temporary parameter file such as this one which had the extension .txt:
0.200000E+03
0.100000E+01
0.522096E+00 0.211909E-01 0.396440E-02 0.101775E-02 0.307366E-03
where it took the 0.522096E+00 0.211909E-01 as X & Y.
I suspect the ASCII reader is just a little too embracing. Maybe if it checked for two consecutive lines with pairs of numbers, or something like that? But tricky.
This should actually not happen — The whole point of the ASCII reader has ALWAYS been that it should cover 95% of cases by looking for X (where I think X is 6 now) rows with EXACTLY the same number of columns of numbers only. Not perfect but as close as one can get.
In fact it appears that the reader is completely broken and reads any row that contains 2 or more numbers (at least if no text is on the line) as valid data … even if the number of columns varies (so will take line 1 with 4 columns and line 2 with 2 and be happy)
As Steve points out this bug seems to have existed as far back as 2.2.1. Looking at the code it clearly has the infrastructure and seems to be set to an X of 5, but somehow the code is seriously flawed or being partially bypassed.
Since this is a pure ASCII reader but and unrelated to the folder reading issue of #614 am creating a new ticket. I add here the files attached to that ticket as examples that get misread.
Attachments (5)
Change History (10)
Changed 8 years ago by butler
Changed 8 years ago by butler
Another alteration of corfunc parameter file demonstrating the strange logic
comment:1 Changed 8 years ago by piotr
Fixing the reader is not difficult given clear set of rules expected.
- do we want it to parse x,y only for consecutive lines of same token length?
- do we want to drop the current x,y once an invalid line is found (as in AnOnerousExample2)?
- do we want to use x,y only when there are > 1 (consecutive) lines with the same number of tokens?
- any other requirements?
I agree the current implementation of the ASCII reader is pretty inefficient and should be at least partially rewritten, so let's do it properly.
comment:2 Changed 8 years ago by butler
- Owner set to krzywon
- Status changed from new to assigned
comment:3 Changed 8 years ago by krzywon
I have modified the reader locally and removed quite a bit of unused and unneeded code within it, trimming it by 120 lines. Before I push, I wanted to be sure what I am loading is the expected result.
Results for the files attached to this ticket:
AnOnerousExample.txt - 2 data points with no error values: X = [45.0,50.0], Y = [2.0,4.0] AnOnerousExample2.txt - 1 data point with dy: X = [56.0], Y = [37.0], dY = [3.0] extract.txt - 1 data point with dx and dy: X = [0.522096], Y = [0.0211909], dY = [0.0039644], dX = [0.00101775] New Text Document.txt - 1 data point, no error: X = [4.0], Y = [5.0] X27000.CF1 - 1 data point, no error: X = [201.0], Y = [1.0]
Results for Anton-Paar_PDH.txt in the test/upcoming_formats repo (does not have the .pdh extension, so it is loaded by the ASCII reader):
673 data points, dy, no dX X_min = 0.06604: X_max = 4.5017 Y_min = 0.005563176: Y_max = 16.83994 dY_min = 0.002348514: dY_max = 0.4175145
comment:4 Changed 8 years ago by butler
Actually for all the examples given they should ALL be "not a data file." The rules were: data must have 5 or more consecutive lines (I thought it was 6 but looking at the code briefly it seems it was 5) that contain
1) two or more numbers
2) ONLY numbers
3) EXACTLY THE SAME number of numbers
This last is predicated on the fact that any real data will be in a consistent format whether 2, 3, 4, 6 or even 10 columns all lines will have the same format.
The reader rejects everything before the first occurence of such 5 consecutive lines. It then reads lines consecutively till it hits an EOF or it hits a line wich does no longer meets the specs above. That line and anything afterwards would be considered footer information.
The reader was never designed to look for more than one data set in a generic ASCII file and probably shouldn't.
comment:5 Changed 8 years ago by krzywon
- Resolution set to fixed
- Status changed from assigned to closed
With that criteria, I made a small change this morning and believe it is now working as expected. I pushed my changes so everyone can test.
OTOKO format header file (ASCII)