onsdag 7. oktober 2009

OCR SharePoint Style

Hi Folks,

How about a Workflow in SharePoint with built in OCR (Opitcal Character Recognition)?

There are a few solutions that provide this but after doing a little research I realised that the Office 2007 applications come with OCR scanning OOB. So I went in search of coding against this office component.

Turns out that a nice little NameSpace (using .NET of course) allows you to minipulate OCR on images. You must have Office 2007 client installed on the Server, so this is one draw back but then again I believe considering the amount you save its worth it.

So we start of by creating a SharePoint Worfklow in (STATE MACHINE/SEQUENTIAL) Visual Studio 2008, Associate it to the correct lists, add some actions and then when you get to your OCR scanning code part reference the following DLL's:

DocumentFormet.OpenXML - This one is from the OpenXML SDK from Microsoft.
MODI - This one is from Office 2003/7 Scanning software.

Add the following using statements:

using DocumentFormat.OpenXml.Packaging;
using DocumentFormat.OpenXml;

Add a Class wide variable:

MODI.Document md;

* If you dont have MODI available its becuase you probably didnt install it when configuring the installation of Office. If you need to goto Control pannel > add/remove programs > Office 2007 > Change then right click the Scanning software and select "intall now" option (not on "install on first time ran"!)

Bellow is a function that will allow you to perform OCR on a TIFF.

private void OCR(String Name)
{
md.Create(Name);
md.OCR(MODI.MiLanguages.miLANG_NORWEGIAN, true, true);

string strText = String.Empty;

MODI.Image image = (MODI.Image)md.Images[0];
MODI.Layout layout = image.Layout;

for (int i = 0; i < layout.Words.Count; i++)
{
MODI.Word word = (MODI.Word)layout.Words[i];
if (strText.Length > 0)
{
strText += " ";
}
strText += word.Text;
}
md.Close(false);
// do something with the Words here...
}

Update soon,....

Ingen kommentarer:

Legg inn en kommentar