Opened 3 years ago

Closed 3 years ago

#658 closed defect (fixed)

ASCII reader very broken

Reported by: butler Owned by: krzywon
Priority: critical Milestone: SasView 4.0.0
Component: SasView Keywords:
Cc: Work Package: SasView Bug Fixing

Description

As reported by Steve king on ticket #614 if

Any non-I(Q) data ASCII files that just happen to have space or tab delimited numbers in the opening lines then these get loaded as data (the first pair of numbers as X & Y). So examples of this include BSL/OTOKO 'header' files, eg: this CORFUNC file had the extension .CF1

201 1 1 1 0 0 0 0 0 0
C:\Temp\X2

where it took the 201 & 1 as X & Y, or perhaps a temporary parameter file such as this one which had the extension .txt:

0.200000E+03
0.100000E+01
0.522096E+00 0.211909E-01 0.396440E-02 0.101775E-02 0.307366E-03

where it took the 0.522096E+00 0.211909E-01 as X & Y.

I suspect the ASCII reader is just a little too embracing. Maybe if it checked for two consecutive lines with pairs of numbers, or something like that? But tricky.

This should actually not happen — The whole point of the ASCII reader has ALWAYS been that it should cover 95% of cases by looking for X (where I think X is 6 now) rows with EXACTLY the same number of columns of numbers only. Not perfect but as close as one can get.

In fact it appears that the reader is completely broken and reads any row that contains 2 or more numbers (at least if no text is on the line) as valid data … even if the number of columns varies (so will take line 1 with 4 columns and line 2 with 2 and be happy)

As Steve points out this bug seems to have existed as far back as 2.2.1. Looking at the code it clearly has the infrastructure and seems to be set to an X of 5, but somehow the code is seriously flawed or being partially bypassed.

Since this is a pure ASCII reader but and unrelated to the folder reading issue of #614 am creating a new ticket. I add here the files attached to that ticket as examples that get misread.

Attachments (5)

X27000.CF1 (98 bytes) - added by butler 3 years ago.
OTOKO format header file (ASCII)
extract.txt (98 bytes) - added by butler 3 years ago.
Corfunc parameter file (ASCII)
New Text Document.txt (6 bytes) - added by butler 3 years ago.
Made up bogus data file (ASCII)
AnOnerousExample.txt (163 bytes) - added by butler 3 years ago.
Adding lines to the corfunc parameter fie
AnOnerousExample2.txt (170 bytes) - added by butler 3 years ago.
Another alteration of corfunc parameter file demonstrating the strange logic

Download all attachments as: .zip

Change History (10)

Changed 3 years ago by butler

OTOKO format header file (ASCII)

Changed 3 years ago by butler

Corfunc parameter file (ASCII)

Changed 3 years ago by butler

Made up bogus data file (ASCII)

Changed 3 years ago by butler

Adding lines to the corfunc parameter fie

Changed 3 years ago by butler

Another alteration of corfunc parameter file demonstrating the strange logic

comment:1 Changed 3 years ago by piotr

Fixing the reader is not difficult given clear set of rules expected.

  • do we want it to parse x,y only for consecutive lines of same token length?
  • do we want to drop the current x,y once an invalid line is found (as in AnOnerousExample2)?
  • do we want to use x,y only when there are > 1 (consecutive) lines with the same number of tokens?
  • any other requirements?

I agree the current implementation of the ASCII reader is pretty inefficient and should be at least partially rewritten, so let's do it properly.

comment:2 Changed 3 years ago by butler

  • Owner set to krzywon
  • Status changed from new to assigned

comment:3 Changed 3 years ago by krzywon

I have modified the reader locally and removed quite a bit of unused and unneeded code within it, trimming it by 120 lines. Before I push, I wanted to be sure what I am loading is the expected result.

Results for the files attached to this ticket:

AnOnerousExample.txt - 2 data points with no error values: X = [45.0,50.0], Y = [2.0,4.0]
AnOnerousExample2.txt - 1 data point with dy: X = [56.0], Y = [37.0], dY = [3.0]
extract.txt - 1 data point with dx and dy: X = [0.522096], Y = [0.0211909], dY = [0.0039644], dX = [0.00101775]
New Text Document.txt - 1 data point, no error: X = [4.0], Y = [5.0]
X27000.CF1 - 1 data point, no error: X = [201.0], Y = [1.0]

Results for Anton-Paar_PDH.txt in the test/upcoming_formats repo (does not have the .pdh extension, so it is loaded by the ASCII reader):

673 data points, dy, no dX
X_min = 0.06604:  X_max = 4.5017
Y_min = 0.005563176:  Y_max = 16.83994
dY_min = 0.002348514:  dY_max = 0.4175145

comment:4 Changed 3 years ago by butler

Actually for all the examples given they should ALL be "not a data file." The rules were: data must have 5 or more consecutive lines (I thought it was 6 but looking at the code briefly it seems it was 5) that contain
1) two or more numbers
2) ONLY numbers
3) EXACTLY THE SAME number of numbers

This last is predicated on the fact that any real data will be in a consistent format whether 2, 3, 4, 6 or even 10 columns all lines will have the same format.

The reader rejects everything before the first occurence of such 5 consecutive lines. It then reads lines consecutively till it hits an EOF or it hits a line wich does no longer meets the specs above. That line and anything afterwards would be considered footer information.

The reader was never designed to look for more than one data set in a generic ASCII file and probably shouldn't.

comment:5 Changed 3 years ago by krzywon

  • Resolution set to fixed
  • Status changed from assigned to closed

With that criteria, I made a small change this morning and believe it is now working as expected. I pushed my changes so everyone can test.

Note: See TracTickets for help on using tickets.